日本語の概要は telemetry.ja.md を参照してください。
Iroha exports Prometheus-format metrics and a JSON status summary. This page lists key metrics and example PromQL queries you can use to build dashboards.
Endpoints
/metrics: Prometheus exposition text. Hidden when telemetry is disabled or the profile does not allow expensive metrics./status: JSON status (hidden when telemetry is disabled). Includes top-level gauges (peers, blocks, queue active count), acrypto { sm_helpers_available, sm_openssl_preview_enabled, halo2: { enabled, curve, backend, max_k, verifier_budget_ms, verifier_max_batch } }snapshot, thesumeragi { leader_index, highest_qc_height, locked_qc_height, locked_qc_view, gossip_fallback_total, view_change_proof_accepted_total, view_change_proof_stale_total, view_change_proof_rejected_total, block_created_dropped_by_lock_total, block_created_hint_mismatch_total, block_created_proposal_mismatch_total, pacemaker_backpressure_deferrals_total, tx_queue_depth, tx_queue_capacity, tx_queue_saturated, epoch_length_blocks, epoch_commit_deadline_offset, epoch_reveal_deadline_offset, prf_epoch_seed (hex), prf_height, prf_view }view (highest/locked QC heights), agovernancesnapshot, and (when available)sorafs_micropayments— the most recent SoraFS micropayment sample per provider including credit counters and ticket totals./v1/sumeragi/new_view(JSON): latest NEW_VIEW receipt counts per(height, view)(bounded in-memory window; oldest entries evicted)./v1/sumeragi/new_view/sse(SSE): periodic stream of the same JSON payload for live dashboards./v1/sumeragi/status(Norito by default): consensus status snapshot. SetAccept: application/jsonto receive{ leader_index, view_change_index, highest_qc { height, view, subject_block_hash }, locked_qc { height, view, subject_block_hash }, commit_qc { height, view, epoch, block_hash, validator_set_hash, validator_set_len, signatures_total }, commit_quorum { height, view, block_hash, signatures_present, signatures_counted, signatures_set_b, signatures_required, last_updated_ms }, tx_queue { depth, capacity, saturated }, epoch { length_blocks, commit_deadline_offset, reveal_deadline_offset }, gossip_fallback_total, block_created_dropped_by_lock_total, block_created_hint_mismatch_total, block_created_proposal_mismatch_total, consensus_message_handling { entries: [{ kind, outcome, reason, total }] }, pacemaker_backpressure_deferrals_total, da_reschedule_total, rbc_store { sessions, bytes, pressure_level, backpressure_deferrals_total, persist_drops_total, evictions_total, recent_evictions[...] }, lane_activity: [{ lane_id, tx_vertices, tx_edges, overlay_count, overlay_instr_total, overlay_bytes_total, rbc_chunks, rbc_bytes_total }], dataspace_activity: [{ lane_id, dataspace_id, tx_served }], rbc_lane_backlog: [{ lane_id, tx_count, total_chunks, pending_chunks, rbc_bytes_total }], rbc_dataspace_backlog: [{ lane_id, dataspace_id, tx_count, total_chunks, pending_chunks, rbc_bytes_total }], lane_commitments: [{ block_height, lane_id, tx_count, total_chunks, rbc_bytes_total, teu_total, block_hash }], dataspace_commitments: [{ block_height, lane_id, dataspace_id, tx_count, total_chunks, rbc_bytes_total, teu_total, block_hash }], lane_governance: [{ lane_id, alias, dataspace_id, visibility, storage_profile, governance, manifest_required, manifest_ready, manifest_path, validator_ids, quorum, protected_namespaces, runtime_upgrade { allow, require_metadata, metadata_key, allowed_ids } }], lane_governance_sealed_total, lane_governance_sealed_aliases, prf { height, view, epoch_seed }, vrf_penalty_epoch, vrf_committed_no_reveal_total, vrf_no_participation_total, vrf_late_reveals_total, collectors_targeted_{current,last_per_block}, redundant_sends_total, worker_loop { stage, stage_started_ms, last_iteration_ms, queue_depths { vote_rx, block_payload_rx, rbc_chunk_rx, block_rx, consensus_rx, lane_relay_rx, background_rx }, queue_diagnostics { blocked_total { vote_rx, block_payload_rx, rbc_chunk_rx, block_rx, consensus_rx, lane_relay_rx, background_rx }, blocked_ms_total { ... }, blocked_max_ms { ... }, dropped_total { ... } } }, commit_inflight { active, id, height, view, block_hash, started_ms, elapsed_ms, timeout_ms, timeout_total, last_timeout_timestamp_ms, last_timeout_elapsed_ms, last_timeout_height, last_timeout_view, last_timeout_block_hash, pause_total, resume_total, paused_since_ms, pause_queue_depths { ... }, resume_queue_depths { ... } }, settlement { dvp { success_total, failure_total, final_state_totals { none|delivery_only|payment_only|both }, failure_reasons, last_event { observed_at_ms, settlement_id, plan { order, atomicity }, outcome, failure_reason, final_state, legs { delivery_committed, payment_committed } } }, pvp { success_total, failure_total, final_state_totals { none|primary_only|counter_only|both }, failure_reasons, last_event { observed_at_ms, settlement_id, plan { order, atomicity }, outcome, failure_reason, final_state, legs { primary_committed, counter_committed }, fx_window_ms } } } }(highest/locked QC snapshots inhighest_qc/locked_qc)./v1/sumeragi/status/sse(SSE): periodic stream (≈1s) emitting the same JSON payload as/v1/sumeragi/statusfor dashboards.- When
nexus.enabled = false(Iroha 2 mode), lane/dataspace sections in/statusand/v1/sumeragi/statusare emptied and Prometheus output omits lane/dataspace labels so single-lane deployments stay lane-free. /v1/sumeragi/rbc(JSON): RBC session/throughput metrics:{ sessions_active, sessions_pruned_total, ready_broadcasts_total, ready_rebroadcasts_skipped_total, deliver_broadcasts_total, payload_bytes_delivered_total, payload_rebroadcasts_skipped_total }./v1/sumeragi/rbc/sessions(JSON): RBC session snapshot:{ sessions_active, items: [{ block_hash, height, view, total_chunks, received_chunks, ready_count, delivered, invalid, payload_hash, recovered, lane_backlog: [{ lane_id, tx_count, total_chunks, pending_chunks, rbc_bytes_total }], dataspace_backlog: [{ lane_id, dataspace_id, tx_count, total_chunks, pending_chunks, rbc_bytes_total }] }] }./v1/sumeragi/pacemaker(JSON): pacemaker timers and config:{ backoff_ms, rtt_floor_ms, jitter_ms, backoff_multiplier, rtt_floor_multiplier, max_backoff_ms, jitter_frac_permille }./v1/sumeragi/qc(Norito by default): highest/locked QC snapshot; includessubject_block_hashfor the highest QC when known. SetAccept: application/jsonto receive the JSON view./v1/sumeragi/commit_qc/{hash}(Norito by default): full commit QC record for a block hash (if present). SetAccept: application/jsonto receive{ subject_block_hash, commit_qc }withparent_state_root,post_state_root, and aggregate signature data when available./v1/sumeragi/leader(JSON): leader index snapshot; includes PRF context{ height, view, epoch_seed }in NPoS mode when available./v1/sumeragi/phases(JSON): compact per-phase latencies (ms) for operator dashboards; returns the latest observed durations for consensus phases./v1/soranet/privacy/{event,share}(Norito): privacy telemetry ingest for relay/collector signals. Requirestorii.soranet_privacy_ingest.enabled = true, a token header (X-SoraNet-Privacy-TokenorX-API-Token) whenrequire_tokenis set, and a CIDR allow-list entry (empty list denies). Rate limits come from the same config (rate_per_sec/burst), and rejects surface401/403/429plussoranet_privacy_ingest_reject_total{endpoint,reason}counters for alerting./v1/sumeragi/collectors(JSON): deterministic collector plan snapshot derived from the committed topology and on-chain parameters; exposesmode, plan(height, view)(whereheightmirrors the current chain height),collectors_k,redundant_send_r,proxy_tail_index,min_votes_for_commit, the ordered collector list, andepoch_seed(hex) when NPoS is active./v1/sumeragi/params(JSON): snapshot of the on-chain Sumeragi parameters{ block_time_ms, commit_time_ms, min_finality_ms, pacing_factor_bps, max_clock_drift_ms, collectors_k, redundant_send_r, da_enabled, next_mode, mode_activation_height, chain_height }./v1/sumeragi/new_view/json(JSON): NEW_VIEW receipt snapshot{ ts_ms, items: [{height, view, count}] }(bounded in-memory window; oldest entries evicted).- Updated: also returns
locked_qc { height, view }.
- Updated: also returns
Aggregate governance-seal counters (lane_governance_sealed_total,
lane_governance_sealed_aliases) ride alongside the lane records. They provide a
quick “are any lanes still sealed?” view in both /v1/sumeragi/status and
iroha_cli --output-format text ops sumeragi status; the CLI prints the alias list inline so
operators can reconcile outstanding manifests without diffing the full payload.
Use iroha_cli app nexus lane-report --only-missing --fail-on-sealed during rollouts
or CI to surface the same data with a non-zero exit when seals remain.
SM helper telemetry (Prometheus metrics)
iroha_sm_syscall_total{kind="hash|verify|seal|open",mode}— cumulative SM helper syscall successes grouped by helper kind and mode (gcm/ccmfor SM4 helpers).iroha_sm_syscall_failures_total{kind,mode,reason}— cumulative failure counts with reason labels (permission_denied,norito_invalid,decode_error, etc.).
Settlement telemetry
iroha_settlement_events_total{kind="dvp|pvp",outcome="success|failure",reason}— settlement lifecycle counters labelled by instruction kind and failure reason (insufficient_funds,counterparty_mismatch,unsupported_policy,zero_quantity,missing_entity,math_error,other; success usesreason="-").iroha_settlement_finality_events_total{kind="dvp|pvp",outcome="success|failure",final_state="none|delivery_only|payment_only|both|primary_only|counter_only"}— finality counters grouped by settlement kind, execution outcome, and which legs remained committed. DvP reportsdelivery_only|payment_only, PvP reportsprimary_only|counter_only;nonemeans both legs rolled back.iroha_settlement_fx_window_ms{kind="pvp",order,atomicity}— histogram of observed PvP FX windows (milliseconds between committed legs) labelled by execution order (delivery_then_payment/payment_then_delivery) and atomicity policy (all_or_nothing|commit_first_leg|commit_second_leg).
Subscription telemetry
iroha_subscription_billing_attempts_total{pricing="fixed|usage"}— billing trigger invocations grouped by pricing kind.iroha_subscription_billing_outcomes_total{pricing="fixed|usage",result="paid|failed|suspended|skipped"}— billing outcomes grouped by pricing kind and result label.
Network time telemetry
nts_offset_ms(gauge) — smoothed or raw offset vs local clock.nts_confidence_ms(gauge) — MAD confidence bound.nts_peers_sampled(gauge) — peers contributing recent samples.nts_samples_used(gauge) — samples used after RTT filtering.nts_fallback(gauge) — 1 when NTS falls back to local time.nts_healthy(gauge) — 1 when health thresholds pass and no fallback.nts_min_samples_ok/nts_offset_ok/nts_confidence_ok(gauges) — per-check health flags.nts_rtt_ms_bucket{le="..."}/nts_rtt_ms_sum/nts_rtt_ms_count— RTT histogram buckets (ms) and aggregates.torii_nts_unhealthy_reject_total(counter) — time-sensitive transactions rejected during admission because NTS is unhealthy.
Runbook guidance
- Alert when
max_over_time(nts_healthy[5m]) == 0ormax_over_time(nts_fallback[5m]) > 0; these indicate the time service is unsynchronized or missing samples. - Use
nts_min_samples_ok,nts_offset_ok, andnts_confidence_okto pinpoint root cause; check/v1/time/statusfor peer sample and RTT diagnostics. - If
enforcement_mode = "reject", admission blocks time-sensitive instructions while unhealthy. Switch towarnonly for temporary operational relief.
Configuration
telemetry_enabled(default: true): Master kill switch. When set to false, the daemon skips telemetry worker startup, Torii hides/metricsand/status, and runtime instrumentation is bypassed regardless of profile.telemetry_profile(default:operator): Capability bundle wiring both Torii routing and runtime sinks. Profiles toggle three capability flags —metrics,expensive_metrics, anddeveloper_outputs. Whentelemetry_enabled = false, the effective profile is forced todisabled.torii.peer_telemetry_urls(default: empty): Optional list of Torii base URLs used to fetch peer telemetry metadata. When unset, peer telemetry discovery is disabled to avoid probing P2P ports.torii.peer_geo.enabled(default: false): Enable peer geo lookups for Torii telemetry (opt-in; requires network access to the configured endpoint).torii.peer_geo.endpoint(default: unset): Optional ip-api compatible endpoint used for peer geo lookups; when unset andtorii.peer_geo.enabled = true, Torii uses the built-in ip-api default.- Build-time ISI instrumentation:
#[metrics]counters (isi{kind="total|success"}) and timing histograms (isi_times) require buildingirohadwith--features expensive-telemetry(oriroha_coreexpensive-telemetry). The runtime still respectstelemetry_enabledandtelemetry_profilefor exposure.
Telemetry redaction and integrity
- Redaction is mandatory for
operator,extended, andfullprofiles; startup rejects configs wheretelemetry_redaction.modeis notstrictor the build lacks thelog-obfuscationfeature. - Field-name normalization is case-insensitive and splits punctuation/camelCase/acronyms into snake_case segments for taxonomy and allow-list checks.
- Sensitive taxonomy is defined by explicit prefixes and keywords (kept in sync with guardrails below).
- Redacted values are replaced with
[REDACTED]; string payloads longer than 2048 bytes are truncated and suffixed with...(truncated)for deterministic export. telemetry_redaction.modeoptions:strict(always redact),allowlist(allow-listed entries bypass keyword redaction but explicit prefixes still redact),disabled(developer-only).telemetry_redaction.allowlistmust be a subset of the approved policy below.
Telemetry redaction prefixes:
redact
sensitive
secret
pii
Telemetry redaction keywords:
password
passwd
passphrase
secret
credential
token
access_token
refresh_token
session_token
session
authorization
cookie
jwt
bearer
api_key
api_key_hash
apikey
private_key
privkey
mnemonic
seed
Approved telemetry redaction allowlist (normalized field names):
(none)
- Redaction audit metrics:
telemetry_redaction_total{reason="keyword|explicit"},telemetry_redaction_skipped_total{reason="allowlist|disabled|unsupported"}, andtelemetry_truncation_total. - Tamper-evident exports: when
telemetry_integrity.enabled = true, websocket telemetry and dev-telemetry JSON lines include achainobject (seq,prev_hash,hash, optionalsignature,key_id).hashisblake3(prev_hash || seq || payload_json)wherepayload_jsonis the Norito JSON serialization of the record.signatureis a keyed Blake3 hash whentelemetry_integrity.signing_key_hex(32-byte hex) is set. Without a persisted state file the chain restarts at sequence 1 on startup. - To persist continuity across restarts, set
telemetry_integrity.state_dirto a writable directory. Each sink writes its own state file (for example,telemetry_integrity_ws.jsonandtelemetry_integrity_dev.json).
Build-time instrumentation
iroha_core/expensive-telemetry(enablesiroha_telemetry/metric-instrumentation) compiles the#[metrics]attribute into Prometheus counters and timing histograms.- Without it,
#[metrics(+"...")]still parses but timing histograms are a no-op; runtimetelemetry_profilestill controls exposure.
Profile capability matrix
| Profile | /status |
/metrics |
Developer routes (/v1/sumeragi/*, SSE) |
Intended use |
|---|---|---|---|---|
disabled |
no | no | no | Telemetry fully off |
operator |
yes | no | no | Production nodes that only need JSON status |
extended |
yes | yes | no | Operators scraping Prometheus |
developer |
yes | no | yes | Local debugging and dashboards |
full |
yes | yes | yes | Combine operator + developer tooling |
Roadmap item NRPC/AND7 requires the Android SDK to follow the same telemetry and privacy guarantees as the Rust node. Use the artefacts below whenever you need to brief operators or confirm governance readiness:
| Artefact | Purpose |
|---|---|
sdk/android/telemetry_redaction.md |
Canonical policy covering hashed authorities, device buckets, retention, and override governance. |
android_runbook.md |
Step-by-step operational workflow (config threading, exporter checks, override handling). |
sdk/android/readiness/signal_inventory_worksheet.md |
Owner matrix for every Android span/event/metric plus validation evidence. |
sdk/android/telemetry_chaos_checklist.md |
Quarterly rehearsal scenarios referenced by SRE governance. |
sdk/android/readiness/and7_operator_enablement.md |
Curriculum outline and knowledge-check plan for support/on-call enablement. |
Key guardrails
- Android exports hashed authorities (
android.torii.http.request/android.torii.http.retry) using the same Blake2b-256 salt published throughiroha_config.telemetry.redaction_salt; watchandroid.telemetry.redaction.salt_versionfor drift during salt rotations. - Device metadata is limited to coarse
android.telemetry.device_profilebuckets (SDK major version +emulator|consumer|enterprise). Alert when bucket ratios diverge by more than 10 % from the Rust node baseline. - Network context drops carrier names entirely; only
network_typeandroamingare exported. Any request for subscriber data should be rejected and handled through the override workflow. - Overrides are logged via
android.telemetry.redaction.overrideand must match the manifest + audit procedure documented in the Android Support Playbook; on-call engineers should updatedocs/source/sdk/android/telemetry_override_log.mdimmediately after apply and revoke operations.
Operational validation
scripts/telemetry/check_redaction_status.pyproduces the status bundle attached to chaos drills and incident timelines; run it against staging before filing SRE readiness evidence.- Chaos rehearsals record Grafana snapshots plus the latest
android.telemetry.redaction.failureandandroid.telemetry.redaction.overridecounters; link those artefacts indocs/source/sdk/android/readiness/labs/. - PromQL snippets:
increase(android.telemetry.redaction.failure_total[5m]) > 0should page immediately outside chaos windows.sum by (device_profile)(android.telemetry.device_profile)compared against the Rustnode.hardware_profilehistogram verifies bucket alignment.clamp_min(rate(android.telemetry.redaction.override_total[1h]), 0)feeds the monthly override audit.
Refer to android_support_playbook.md
for the escalation tree that ties these metrics back to pager rotations.
governance_proposals_status{status}(gauge): current proposal counts grouped by status (proposed,approved,rejected,enacted). The/statusJSON exposes the same data undergovernance.proposals, and the gauges are seeded from the recovered world state on startup so they reflect persisted proposals even before new transitions occur.governance_protected_namespace_total{outcome}(counter): admission enforcement for protected namespaces. Increments withoutcome="allowed"when a deployment is backed by an enacted proposal andoutcome="rejected"when the gate blocks it.governance_manifest_admission_total{result}(counter): queue admission outcomes driven by lane manifests. Eachresultlabel captures a distinct path:allowed,missing_manifest,non_validator_authority,quorum_rejected,protected_namespace_rejected,runtime_hook_rejected.governance_manifest_quorum_total{outcome}(counter): manifest quorum checks on the validator set.outcome="satisfied"records admits that met the quorum, whileoutcome="rejected"flags submissions missing the required validator approvals.governance_manifest_hook_total{hook, outcome}(counter): governance hook enforcement decisions. Currentlyhook="runtime_upgrade"is emitted with outcomesallowed/rejectedwhenever runtime upgrade submissions pass or fail manifest policy.governance_manifest_activations_total{event}(counter): manifest lifecycle events emitted byEnactReferendum.event="manifest_inserted"counts new manifests keyed bycode_hash;event="instance_bound"counts namespace bindings (contract instance activations)./statusnow includes agovernanceobject with proposal counts, protected namespace totals, aggregated manifest admission outcomes, and arecent_manifest_activationsarray listing the most recent enactments (namespace, contract id, code/ABI hash, block height, and activation timestamp in milliseconds).
Example governance excerpt from /status:
"governance": {
"proposals": {
"proposed": 2,
"approved": 1,
"rejected": 0,
"enacted": 1
},
"protected_namespace": {
"total_checks": 5,
"allowed": 4,
"rejected": 1
},
"manifest_admission": {
"allowed": 4,
"missing_manifest": 1,
"non_validator_authority": 0,
"quorum_rejected": 1,
"protected_namespace_rejected": 2,
"runtime_hook_rejected": 0
},
"manifest_quorum": {
"satisfied": 3,
"rejected": 1
},
"recent_manifest_activations": [
{
"namespace": "apps",
"contract_id": "demo.contract",
"code_hash_hex": "deadbeef",
"abi_hash_hex": "cafebabe",
"height": 42,
"activated_at_ms": 1234567
}
]
}The governance surface enforces that only enacted manifests may bind contract instances inside protected namespaces. When rolling out runtime upgrades, operators should validate that manifests were activated and that subsequent deploys flowed through the protected gate.
Key dashboards
- Import
docs/source/grafana_governance_constraints.jsoninto Grafana (JSON dashboard). The template exposes the following panels:Proposal Status Counts— shows the live values ofgovernance_proposals_status{status=…}so proposal transitions can be reconciled with council decisions.Protected Namespace Enforcement (5m)— plotsincrease(governance_protected_namespace_total{outcome=…}[5m])to detect deploy attempts that were allowed or rejected after the upgrade.Manifest Quorum Checks (5m)— chartsincrease(governance_manifest_quorum_total{outcome=…}[5m])so missing quorum approvals surface quickly during incident response.Manifest Admission Outcomes (5m)— visualisesincrease(governance_manifest_admission_total{result=…}[5m]), breaking down successful admits versus each rejection path (missing manifest, validator mismatch, quorum, namespace, runtime hook).Manifest Activations (5m)— chartsincrease(governance_manifest_activations_total{event=…}[5m]), confirming thatEnactReferenduminserted manifests and bound namespaces to the newcode_hash.Rejected Deploys (24h)— a daily stat overincrease(governance_protected_namespace_total{outcome="rejected"}[24h])highlighting unexpected admission failures.
Alert thresholds
- Protected namespaces: alert when
increase(governance_protected_namespace_total{outcome="rejected"}[15m]) > 0outside of a planned rollback window. - Manifest gate regressions: alert when
increase(governance_manifest_admission_total{result!="allowed"}[5m]) > 0to catch sustained rejections (missing manifests, quorum failures, namespace or runtime-hook policy violations).
- Metric to watch:
torii_lane_admission_latency_seconds{lane_id,endpoint}(histogram; 0.75 s P95 budget across transaction admission paths). - Alert snippet: see
dashboards/alerts/soranet_lane_rules.yml— theSoranetLaneAdmissionLatencyDegradedrule fires when the rolling 5 minute P95 exceeds 750 ms.
When the alert triggers:
- Inspect
histogram_quantile(0.5/0.95/0.99, sum by (lane_id,endpoint,le)(rate(torii_lane_admission_latency_seconds_bucket[5m])))to confirm whether the regression is confined to a single endpoint (e.g./transactionvs/v1/contracts/instance/activate). - Pull
/v1/sumeragi/statusand reviewlane_activityfor the affected lane. A spike intx_vertices,overlay_bytes_total, orrbc_bytes_totalhints at admission pressure rather than infrastructure issues. - Check
rbc_lane_backlogandrbc_dataspace_backlogin the same status payload. If backlog is accumulating, verify gossip health and DA fetch telemetry (dashboards/grafana/soranet_pq_ratchet.jsoncovers PQ circuit status) before blaming the admission tier. - Confirm Torii pacemaker windows by charting
sumeragi_pacemaker_backoff_msandsumeragi_pacemaker_rtt_floor_ms. A sudden increase usually indicates view-change churn that will also manifest as higher admission latency. - If latency remains elevated after clearing backlog, throttle the offending
lane by raising
iroha_config.torii.transaction_lane.max_inflightor redirecting traffic to a healthy lane using the orchestrator/CLI routing helpers. Document changes in the operations notebook and revert once metrics stabilise.
Always capture a snapshot of the Grafana panel and /status payload before and
after remediation; include them in the incident timeline for audit parity.
Operators monitoring proposal health should import
docs/source/grafana_sumeragi_overview.json into Grafana. The dashboard tracks:
Highest vs Locked QC Height— gaugessumeragi_highest_qc_heightandsumeragi_locked_qc_heightso you can spot stalled view changes or peers lagging behind the canonical highest/locked certificates.Proposal Drop Rates (5m)— visualisesincrease(sumeragi_gossip_fallback_total[5m])alongside the BlockCreated drop counters (block_created_dropped_by_lock_total,block_created_hint_mismatch_total,block_created_proposal_mismatch_total) to highlight misbehaving leaders and collectors.Proposal Drop Totals— a stat panel over the cumulative counters for quick summarisation in NOC dashboards.
Pair the panel with the alert snippets above (hint/proposal mismatch bursts) to trigger remediation workflows when drops exceed acceptable limits.
- Missing activations: alert when
increase(governance_manifest_activations_total{event="instance_bound"}[30m]) == 0during an upgrade rollout window; a namespace binding never landed on-chain. - Proposal drift: alert when
governance_proposals_status{status="proposed"}remains non-zero for longer than the agreed SLA (e.g., 24 h) whilestatus="approved"stays flat — council approvals are stuck in enactment.
Triage checklist
- Run
iroha_cli app gov deploy audit --namespace <ns>(optionally filter by--hash-prefix) to compare stored manifests against thecode_hashandabi_hashrecorded in governance proposals. The command flags mismatches or missing manifests. - Fetch
/statusand inspectgovernance.recent_manifest_activationsto confirm the latest activation contains the expectedcode_hash_hexandabi_hash_hexat the upgrade height. - Examine
increase(governance_protected_namespace_total{outcome="rejected"}[5m]). If it spiked, grab Torii logs for the failing deploy and ensure the proposal was enacted; re-runiroha_cli app gov protected getto confirm the namespace list. - Verify that
governance_proposals_status{status="approved"}decreased whilestatus="enacted"increased after the rollout. If counts drift, queue anEnactReferendumcheck and confirm theiroha_cli app gov enactautomation ran. - Inspect
increase(governance_manifest_hook_total{hook="runtime_upgrade", outcome="rejected"}[5m])to spot runtime upgrade submissions blocked by manifest policy. Correlate spikes with Torii admission logs and confirm the manifest allowlist / metadata requirements match the proposal that triggered the hook.
Operators monitoring attachment throughput and the background prover should use
docs/source/zk/prover_runbook.md as their primary guide. It captures log
sources, alert thresholds, and mitigation steps for queue backlogs and budget
exhaustion.
Dashboards
- Import
docs/source/grafana_zk_prover.jsonto get queue depth, latency, and budget panels. Update the Prometheus data source UID after import if your installation uses a custom name. - Overlay
histogram_quantile(0.95, sum(rate(zk_verify_latency_ms_bucket[5m])) by (le, backend))withhistogram_quantile(0.95, sum(rate(torii_zk_prover_latency_ms_bucket[5m])) by (le))to spot systemic latency spikes.
Alert hints
- Backlog:
avg_over_time(torii_zk_prover_pending[10m]) > 0(page). - Budget hits:
increase(torii_zk_prover_budget_exhausted_total{reason="bytes"}[30m]) > 0(ticket) and escalate if the rate persists across multiple windows. - Latency:
histogram_quantile(0.95, sum(rate(torii_zk_prover_latency_ms_bucket[5m])) by (le))exceedingtorii.zk_prover_max_scan_millisfor >15 minutes. - Sanitizer rejects:
increase(torii_attachment_reject_total[10m]) > 0to catch unsupported types, expansion limits, or malformed payloads; correlate withhistogram_quantile(0.95, sum(rate(torii_attachment_sanitize_ms_bucket[5m])) by (le)).
Triage outline
- Inspect
torii_zk_prover_pending,torii_zk_prover_inflight, andtorii_zk_prover_last_scan_msto understand queue pressure. - Review Torii logs (target
torii::zk_prover) for scan summaries and budget hits. Enable debug logging temporarily when attachment ids are required. - Confirm configuration values in
iroha_config([torii] zk_prover_*). - Prune or retry problematic attachments with
iroha_cli app zk attachments deletewhen backlog cleanup is required. - Document any threshold changes in the ops notebook and update Grafana panel annotations.
Torii integration
-
Torii::new_with_handleaccepts arouting::MaybeTelemetrygate that pairs the runtimeTelemetryhandle with the activeTelemetryProfile. Userouting::MaybeTelemetry::from_profile(runtime_handle, profile)to construct the gate, orrouting::MaybeTelemetry::disabled()when telemetry is unavailable. -
Torii::new(when thetelemetryfeature is enabled) remains as a convenience wrapper; it now forwards tonew_with_handlewith an operator profile by default. Tests can userouting::MaybeTelemetry::for_tests()to obtain an in-process telemetry handle. -
torii_address_invalid_total{endpoint,reason}increments whenever HTTP routes reject an account identifier (invalid I105 payloads, domain mismatches, etc.). Keep the<0.1%SLO by watching the dedicated Grafana board indashboards/grafana/address_ingest.json. -
torii_address_collision_total{endpoint,kind="local12_digest"}andtorii_address_collision_domain_total{endpoint,domain}record Local‑12 selector collisions. Both feed the collision panel/alert indashboards/grafana/address_ingest.jsonso operators can tie spikes to specific domains. Production should stay flat; any increment blocks manifest promotions until governance signs off on the fix.
Pipeline metrics
- pipeline_stage_ms: Histogram of per-stage durations with label
stagein {"access","overlays","dag","schedule","apply","layers_prep","layers_exec","layers_merge"}. - pipeline_dag_vertices, pipeline_dag_edges, pipeline_conflict_rate_bps: Latest validated block DAG shape and conflict rate in basis points.
- pipeline_access_set_source_total{source=manifest_hints|entrypoint_hints|prepass_merge|conservative_fallback}: Cumulative access-set derivation counts by source.
- pipeline_overlay_count, pipeline_overlay_instructions, pipeline_overlay_bytes: Overlay stats for latest block.
- pipeline_peak_layer_width, pipeline_layer_avg_width, pipeline_layer_median_width: Layer width summary.
- pipeline_layer_count: Number of scheduler layers for the latest block.
- pipeline_scheduler_utilization_pct: Average parallelism utilization (0..100) computed as
avg_width / peak_width * 100. - pipeline_detached_prepared, pipeline_detached_merged, pipeline_detached_fallback: Detached execution counters per latest block.
IVM cache metrics
- ivm_cache_hits, ivm_cache_misses, ivm_cache_evictions: Global pre-decode cache counters (cumulative).
- ivm_cache_decoded_streams, ivm_cache_decoded_ops_total: Cumulative decode workload counters (number of streams and total ops) reported by the IVM pre-decode cache.
- ivm_cache_decode_failures, ivm_cache_decode_time_ns_total: Cumulative decode failure count and wall-clock nanoseconds spent decoding.
IVM register pressure metrics (new)
- ivm_register_max_index (histogram): highest GPR index touched during a VM execution. Buckets cover 16 → 512 so you can alert when contracts spill beyond the hot register bank.
- ivm_register_unique_count (histogram): number of distinct GPRs accessed per execution. Useful to spot contracts with high register churn before Merkle tagging becomes expensive.
- Example PromQL:
histogram_quantile(0.9, sum(rate(ivm_register_max_index_bucket[5m])) by (le))— p90 of the highest register index over the last 5 minutes.histogram_quantile(0.95, sum(rate(ivm_register_unique_count_bucket[5m])) by (le))— p95 of unique registers touched.
- Alert hint: page when
histogram_quantile(0.95, …)holds above ~256 for >10m; this signals workloads exceeding the intended fast-register tier and justifies enabling the tiered register RFC.
Block/consensus metrics
- commit_time_ms (histogram), last_commit_time_ms (gauge), block_height, block_height_non_empty, txs{type in [accepted,rejected,total]}.
- Queue gauges:
queue_size(active queue size, queued + in-flight),queue_queued(waiting in the hash queue),queue_inflight(selected but not yet committed).
P2P metrics (selected)
- connected_peers,
p2p_peer_churn_total{event="connected|disconnected"}, p2p_* gauges/counters for queue depth/drops, throttling, DNS, handshake latencies (p2p_handshake_ms_*). consensus_ingress_drop_total{topic,reason}counts consensus ingress drops for payload topics (topicinConsensusPayload|ConsensusChunk|BlockSync,reasoninrate|bytes|rbc_session_limit|penalty).
Sumeragi metrics
- Counters:
sumeragi_tail_votes_total,sumeragi_widen_before_rotate_total,sumeragi_view_change_suggest_total,sumeragi_view_change_install_total; histogram:sumeragi_cert_size(signatures per committed block). - Commit quorum/certificate:
sumeragi_commit_signatures_present,sumeragi_commit_signatures_counted,sumeragi_commit_signatures_set_b,sumeragi_commit_signatures_requiredtrack the last commit tally;sumeragi_commit_certificate_height,sumeragi_commit_certificate_view,sumeragi_commit_certificate_epoch,sumeragi_commit_certificate_signatures_total,sumeragi_commit_certificate_validator_set_lensummarize the latest commit certificate. - Queue health:
sumeragi_tx_queue_depth/sumeragi_tx_queue_capacitygauge the live mempool size and effective ceiling;sumeragi_tx_queue_saturatedflips to1when Torii reports saturation, signalling that redundant collector fan-out is temporarily suppressed. - Pending blocks:
sumeragi_pending_blocks_totalcounts pending blocks tracked by the local node;sumeragi_pending_blocks_blockingisolates those that gate proposal/view-change progress;sumeragi_commit_inflight_queue_depthshows whether the commit pipeline is busy (0/1). - Proposal gaps:
sumeragi_proposal_gap_totalcounts view-change rotations triggered because no proposal was observed before the cutoff. - VRF emission:
sumeragi_vrf_commits_emitted_total,sumeragi_vrf_reveals_emitted_total, andsumeragi_vrf_reveals_late_totalcount how many commit/reveal messages this validator broadcast (including late reveals accepted after the window). Pair withsumeragi_vrf_non_reveal_*counters to monitor participation health at epoch boundaries. - Collector fan-out:
sumeragi_redundant_sends_total(aggregate),sumeragi_redundant_sends_by_peer{peer="…"}, andsumeragi_redundant_sends_by_collector{idx="…"}highlight redundant collector sends; investigate sustained spikes to locate congested collectors or unhealthy peers. - Collector targeting:
sumeragi_collectors_targeted_current(gauge) tracks the in-flight collector count for the current block;sumeragi_collectors_targeted_per_blockhistogram (*_bucket) records how many collectors were targeted per committed block. - DA availability warnings:
sumeragi_rbc_da_reschedule_total(and/v1/sumeragi/status → da_reschedule_total) is legacy and no longer increments; usesumeragi_da_gate_block_total{reason="missing_local_data"}for missing local payloads. - Channel pressure:
sumeragi_dropped_block_messages_totalandsumeragi_dropped_control_messages_totalpartition channel drops;dropped_messagesremains the aggregate counter for existing dashboards.
Sumeragi additions (new series)
sumeragi_highest_qc_height(gauge) — current adopted highest QC height.sumeragi_new_view_publish_total(counter) — NEW_VIEW messages published by this node.sumeragi_new_view_recv_total(counter) — NEW_VIEW messages received and accepted by this node.- See also:
sumeragi_new_view_receipts_by_hv{height="<h>",view="<v>"}for per-(height,view) receipt counts.
- See also:
sumeragi_post_to_peer_total{peer}(counter) — post attempts to peers (collector routing and backpressure insight).sumeragi_bg_post_enqueued_total{kind}(counter) — background-post tasks enqueued by kind in {Post,Broadcast}.sumeragi_bg_post_overflow_total{kind}(counter) — background-post queue full events; sender blocks until space is available.sumeragi_bg_post_drop_total{kind}(counter) — background-post drops when the queue is missing or disconnected.sumeragi_bg_post_queue_depth(gauge) — global background-post queue depth.sumeragi_bg_post_queue_depth_by_peer{peer}(gauge) — per-collector background-post queue depth.
Sumeragi pacemaker
- Config gauges:
sumeragi_pacemaker_backoff_multiplier— backoff multiplier applied to each view-change increment.sumeragi_pacemaker_rtt_floor_multiplier— multiplier for RTT-based floor.sumeragi_pacemaker_max_backoff_ms— maximum backoff cap applied to the pacemaker window.sumeragi_pacemaker_jitter_frac_permille— jitter band as permille of the window (0..=1000).
- Runtime gauges:
sumeragi_pacemaker_backoff_ms— current backoff window (ms) that gates the next view-change suggestion.sumeragi_pacemaker_rtt_floor_ms— current RTT-based floor (ms) considered when computing the backoff window; 0 when no RTT samples.sumeragi_pacemaker_jitter_ms— jitter magnitude applied (ms; absolute value).
PromQL examples
- Pacemaker backoff trend (avg over 5m):
- avg_over_time(sumeragi_pacemaker_backoff_ms[5m])
- RTT floor trend (avg over 5m):
- avg_over_time(sumeragi_pacemaker_rtt_floor_ms[5m])
- Verify config:
- max(sumeragi_pacemaker_backoff_multiplier)
- max(sumeragi_pacemaker_rtt_floor_multiplier)
- max(sumeragi_pacemaker_max_backoff_ms)
- max(sumeragi_pacemaker_jitter_frac_permille)
NEW_VIEW receipts
- GaugeVec:
sumeragi_new_view_receipts_by_hv{height="<h>",view="<v>"}— deduplicated NEW_VIEW sender count for (height, view).
- Counter:
sumeragi_new_view_dropped_by_lock_total— NEW_VIEW frames rejected because the advertised highest certificate is behind the current locked certificate.- Example queries:
- Latest counts across recent heights:
sum by (height,view) (sumeragi_new_view_receipts_by_hv) - Filter for current height h:
sumeragi_new_view_receipts_by_hv{height="<h>"}
- Latest counts across recent heights:
- Operator endpoints:
- JSON snapshot:
GET /v1/sumeragi/new_view→{ ts_ms, items: [{height,view,count}, ...] } - SSE stream:
GET /v1/sumeragi/new_view/sse(1s interval) emits the same structure per event. - Note: counts are kept in a bounded in-memory window; oldest
(height, view)entries are evicted.
- JSON snapshot:
Example PromQL
- P50/P90 stage latency (ms):
- histogram_quantile(0.5, sum(rate(pipeline_stage_ms_bucket[5m])) by (le,stage))
- histogram_quantile(0.9, sum(rate(pipeline_stage_ms_bucket[5m])) by (le,stage))
- Commit time P95 (ms):
- histogram_quantile(0.95, sum(rate(commit_time_ms_bucket[5m])) by (le))
- IVM cache hit rate (%):
- 100 * (ivm_cache_hits - ivm_cache_hits offset 5m) / clamp_min((ivm_cache_hits - ivm_cache_hits offset 5m) + (ivm_cache_misses - ivm_cache_misses offset 5m), 1)
- Detached merge ratio:
- pipeline_detached_merged / clamp_min(pipeline_detached_prepared, 1)
- Sumeragi tail votes rate (s⁻¹):
- rate(sumeragi_tail_votes_total[5m])
- Widen-before-rotate rate (s⁻¹):
- rate(sumeragi_widen_before_rotate_total[5m])
- View-change suggests vs installs (s⁻¹):
- rate(sumeragi_view_change_suggest_total[5m])
- rate(sumeragi_view_change_install_total[5m])
- Certificate size P90 (signatures):
- histogram_quantile(0.9, sum(rate(sumeragi_cert_size_bucket[5m])) by (le))
- Collector targeting distribution:
- sum by (le) (rate(sumeragi_collectors_targeted_per_block_bucket[5m]))
- histogram_quantile(0.95, sum by (le) (rate(sumeragi_collectors_targeted_per_block_bucket[5m])))
- Redundant send spikes / top offenders:
- rate(sumeragi_redundant_sends_total[5m])
- topk(5, sum by (peer) (rate(sumeragi_redundant_sends_by_peer[5m])))
- sum by (idx) (rate(sumeragi_redundant_sends_by_collector[5m]))
- Channel drop alerts:
- rate(sumeragi_dropped_block_messages_total[5m])
- rate(sumeragi_dropped_control_messages_total[5m])
- rate(dropped_messages[5m])
Sumeragi phases latencies (operator dashboards)
- Endpoint:
GET /v1/sumeragi/phases(JSON) - Shape:
{ propose_ms, collect_da_ms, collect_prevote_ms, collect_precommit_ms, collect_aggregator_ms, commit_ms, pipeline_total_ms, ema_ms }whereema_msmirrors the phase keys (propose_ms, …,collect_aggregator_ms,commit_ms,pipeline_total_ms). - Purpose: quick, compact snapshot of the latest observed durations (milliseconds) for each consensus phase to power lightweight dashboards.
collect_aggregator_mstracks redundant collector fan-out latency (validator → secondary collectors). Pair it withsumeragi_redundant_sends_*counters when tuning K/r parameters or alert thresholds. Gossip fallback frequency surfaces viasumeragi_gossip_fallback_total, and proposal drops caused by the locked QC gate are exported asblock_created_dropped_by_lock_total. Header rejections are split betweenblock_created_hint_mismatch_total(height/view/parent mismatches) andblock_created_proposal_mismatch_total(proposal header/payload mismatches that emit InvalidProposal evidence).pipeline_total_mssums the pacemaker-controlled phases (propose, collect_da, collect_prevote, collect_precommit, commit) to provide a single end-to-end latency figure;collect_aggregator_msremains a separate fan-out signal and is not included.
Example response
{
"propose_ms": 11,
"collect_da_ms": 22,
"collect_prevote_ms": 33,
"collect_precommit_ms": 44,
"collect_aggregator_ms": 50,
"commit_ms": 77,
"pipeline_total_ms": 187,
"collect_aggregator_gossip_total": 3,
"block_created_dropped_by_lock_total": 1,
"block_created_hint_mismatch_total": 2,
"block_created_proposal_mismatch_total": 4,
"ema_ms": {
"propose_ms": 15,
"collect_da_ms": 26,
"collect_prevote_ms": 37,
"collect_precommit_ms": 48,
"collect_aggregator_ms": 57,
"commit_ms": 81,
"pipeline_total_ms": 207
}
}Alert snippets
- Hint mismatch burst:
increase(block_created_hint_mismatch_total[5m]) > 0 - Proposal mismatch burst:
increase(block_created_proposal_mismatch_total[5m]) > 0 - Locked QC gate drop spike:
increase(block_created_dropped_by_lock_total[5m]) > 0 - Pacemaker deferrals under sustained load:
increase(sumeragi_pacemaker_backpressure_deferrals_total[5m]) > 0 - Pacemaker deferrals by reason:
increase(sumeragi_pacemaker_backpressure_deferrals_by_reason_total[5m]) > 0
Sumeragi pacemaker (example)
- Endpoint:
GET /v1/sumeragi/pacemaker - Shape:
{ backoff_ms, rtt_floor_ms, jitter_ms, backoff_multiplier, rtt_floor_multiplier, max_backoff_ms, jitter_frac_permille, round_elapsed_ms, view_timeout_target_ms, view_timeout_remaining_ms }
Example response
{
"backoff_ms": 500,
"rtt_floor_ms": 120,
"jitter_ms": 15,
"backoff_multiplier": 2,
"rtt_floor_multiplier": 2,
"max_backoff_ms": 60000,
"jitter_frac_permille": 50,
"round_elapsed_ms": 340,
"view_timeout_target_ms": 1000,
"view_timeout_remaining_ms": 660
}Sumeragi QC snapshot (example)
- Endpoint:
GET /v1/sumeragi/qc - Shape:
{ highest_qc: { height, view, subject_block_hash }, locked_qc: { height, view } }(QC snapshot)
Example response
{
"highest_qc": {
"height": 1234,
"view": 7,
"subject_block_hash": "9f1b0c7b59f1e2a3d4c5b6a79800112233445566778899aabbccddeeff001122"
},
"locked_qc": { "height": 1229, "view": 6 }
}Prometheus exports matching gauges for these snapshots:
sumeragi_highest_qc_heightsumeragi_locked_qc_heightsumeragi_locked_qc_view
fraud_psp_assessments_total{tenant,band,lane,subnet}— counter incremented whenever Torii admits a PSP assessment. Use it to monitor tenant activity and severity distribution.fraud_psp_attestation_total{tenant,engine,lane,subnet,status}— attestation verifier outcomes (status="verified"on success). Alerts should trigger on non-verified statuses to catch PSP regressions or key rotation issues.fraud_psp_missing_assessment_total{tenant,lane,subnet,cause}— counter for transactions missing metadata.cause="missing"means the host rejected the transaction;cause="grace"denotes temporary bypass viamissing_assessment_grace_secs.fraud_psp_invalid_metadata_total{tenant,field,lane,subnet}— counter tracking malformed metadata fields (e.g., missing tenant, non-numeric latency).fraud_psp_latency_ms{tenant,lane,subnet}— histogram of PSP-reported scoring latency in milliseconds; buckets follow an exponential series (5 ms … ~1.3 s).fraud_psp_score_bps{tenant,band,lane,subnet}— histogram of risk scores (0–10 000 bps) recorded after admission.fraud_psp_outcome_mismatch_total{tenant,direction,lane,subnet}— counter capturing outcome drift:direction="missed_fraud"when a confirmed fraud cleared withband∈{low,medium},direction="false_positive"when aband∈{high,critical}decision resolved as non-fraud.
PromQL starters:
- Tenant heartbeat:
sum by (tenant) (rate(fraud_psp_assessments_total[5m])) - Latency P95 per tenant:
histogram_quantile(0.95, sum(rate(fraud_psp_latency_ms_bucket[10m])) by (le,tenant)) - Grace-window watchdog:
sum(rate(fraud_psp_missing_assessment_total{cause="grace"}[10m])) - Mismatch ratio:
sum(rate(fraud_psp_outcome_mismatch_total{direction="missed_fraud"}[1h])) / clamp_min(sum(rate(fraud_psp_assessments_total[1h])), 1)
Telemetry expects the following transaction metadata to be present when fraud monitoring is enabled: fraud_assessment_band, fraud_assessment_tenant, fraud_assessment_score_bps, fraud_assessment_latency_ms, and, once PSPs complete post-incident triage, fraud_assessment_disposition (values documented in docs/source/fraud_monitoring_system.md).
Sumeragi leader (example)
- Endpoint:
GET /v1/sumeragi/leader - Shape:
{ leader_index, prf: { height, view, epoch_seed } }
Example response
{
"leader_index": 3,
"prf": {
"height": 1234,
"view": 7,
"epoch_seed": "c0ffee1234567890deadbeef00112233445566778899aabbccddeeff00112233"
}
}Sumeragi RBC (status example)
- Endpoint:
GET /v1/sumeragi/rbc - Shape:
{ sessions_active, sessions_pruned_total, ready_broadcasts_total, ready_rebroadcasts_skipped_total, deliver_broadcasts_total, payload_bytes_delivered_total, payload_rebroadcasts_skipped_total }
Example response
{
"sessions_active": 2,
"sessions_pruned_total": 10,
"ready_broadcasts_total": 8,
"ready_rebroadcasts_skipped_total": 3,
"deliver_broadcasts_total": 7,
"payload_bytes_delivered_total": 1234567,
"payload_rebroadcasts_skipped_total": 5
}Sumeragi RBC sessions (example)
- Endpoint:
GET /v1/sumeragi/rbc/sessions - Shape:
{ sessions_active, items: [{ block_hash, height, view, total_chunks, received_chunks, ready_count, delivered, invalid, payload_hash, recovered, lane_backlog: [{ lane_id, tx_count, total_chunks, pending_chunks, rbc_bytes_total }], dataspace_backlog: [{ lane_id, dataspace_id, tx_count, total_chunks, pending_chunks, rbc_bytes_total }] }] }
Example response
{
"sessions_active": 1,
"items": [
{
"block_hash": "7a6f2d3c4b5a9e8d7c6b5a4c3d2e1f0a11223344556677889900aabbccddeeff",
"height": 1234,
"view": 7,
"total_chunks": 12,
"received_chunks": 12,
"ready_count": 5,
"delivered": true,
"invalid": false,
"payload_hash": "f1e2d3c4b5a697887766554433221100ffeeddccbbaa00998877665544332211",
"recovered": true,
"lane_backlog": [
{
"lane_id": 0,
"tx_count": 6,
"total_chunks": 12,
"pending_chunks": 0,
"rbc_bytes_total": 786432
}
],
"dataspace_backlog": [
{
"lane_id": 0,
"dataspace_id": 0,
"tx_count": 6,
"total_chunks": 12,
"pending_chunks": 0,
"rbc_bytes_total": 786432
}
]
}
]
}Sumeragi telemetry snapshot
- Endpoint:
GET /v1/sumeragi/telemetry - Shape:
{ availability: { total_votes_ingested, collectors: [{ collector_idx, peer_id, votes_ingested }] }, qc_latency_ms: [{ kind, last_ms }], rbc_backlog: { pending_sessions, total_missing_chunks, max_missing_chunks } }
Example response
{
"availability": {
"total_votes_ingested": 12,
"collectors": [
{
"collector_idx": 4,
"peer_id": "ed0120...",
"votes_ingested": 5
}
]
},
"qc_latency_ms": [
{
"kind": "availability",
"last_ms": 138
}
],
"rbc_backlog": {
"pending_sessions": 1,
"total_missing_chunks": 3,
"max_missing_chunks": 2
}
}Layer widths and utilization
- Peak width per block: max_over_time(pipeline_peak_layer_width[5m])
- Average width trend: avg_over_time(pipeline_layer_avg_width[5m])
- Median width trend: avg_over_time(pipeline_layer_median_width[5m])
- Layer count trend: avg_over_time(pipeline_layer_count[5m])
- Scheduler utilization (P50): quantile_over_time(0.5, pipeline_scheduler_utilization_pct[10m])
- Width histogram (e.g., layers <= 8): sum by (le) (increase(pipeline_layer_width_hist_bucket{le="8"}[5m]))
- Overlay volume (instructions/bytes):
- rate(pipeline_overlay_instructions[5m])
- rate(pipeline_overlay_bytes[5m])
Configuration
- Telemetry master switch: when disabled in
iroha_config, Torii hides/metricsand/status, no telemetry outputs are started, and all observations are skipped. Gauges/counters remain unchanged while disabled. - Halo2 verifier gauges:
iroha_zk_halo2_enabled: 0/1 flag indicating whether Halo2 verification is active.iroha_zk_halo2_curve_id: numeric identifier for the selected curve (0=Pallas,1=Pasta,2=Goldilocks,3=Bn254).iroha_zk_halo2_backend_id: numeric identifier for the backend (0=IPA,1=Unsupported).iroha_zk_halo2_max_k: maximum supported circuit exponent (N = 2^k).iroha_zk_halo2_verifier_budget_ms: soft verifier time budget per proof (milliseconds).iroha_zk_halo2_verifier_max_batch: maximum proofs accepted in a batch verification.
DA/RBC (Sumeragi) configuration
sumeragi.da.enabled(bool): enables data availability tracking and Reliable Broadcast (RBC) payload distribution together. Availability evidence (availability evidenceor an RBCREADYquorum) is tracked (advisory; does not gate commit); missing local payloads are fetched via RBC or block sync, and RBC remains transport/recovery while its delivery latency is still tracked.sumeragi.advanced.rbc.chunk_max_bytes(usize): maximum bytes per RBC chunk when broadcasting payloads; must be > 0. Clamped at startup so serialized RBC chunks fit within the consensus payload plaintext cap derived fromnetwork.max_frame_bytes_block_sync.sumeragi.advanced.rbc.session_ttl_ms(u64): inactive RBC sessions are pruned after this TTL (milliseconds) to bound memory.sumeragi.advanced.rbc.rebroadcast_sessions_per_tick(usize): cap on RBC session rebroadcasts per tick to prevent payload storms when backlogs accumulate.
Metrics: RBC exports gauges/counters (sumeragi_rbc_sessions_active, sumeragi_rbc_sessions_pruned_total, sumeragi_rbc_ready_broadcasts_total, sumeragi_rbc_deliver_broadcasts_total, sumeragi_rbc_payload_bytes_delivered_total, sumeragi_rbc_rebroadcast_skipped_total{kind="payload|ready"}, sumeragi_rbc_mismatch_total{peer,kind}, sumeragi_rbc_persist_drops_total) and per-lane/dataspace backlog gauges (sumeragi_rbc_lane_{tx_count,total_chunks,pending_chunks,bytes_total}{lane_id} and sumeragi_rbc_dataspace_{tx_count,total_chunks,pending_chunks,bytes_total}{lane_id,dataspace_id}) alongside the Torii JSON endpoints shown above. The rebroadcast-skipped counters increment whenever the core skips payload/READY rebroadcasts.
Additional gauges track backlog pressure: sumeragi_rbc_backlog_chunks_total, sumeragi_rbc_backlog_chunks_max, and sumeragi_rbc_backlog_sessions_pending.
- Capture live snapshots. Start with
iroha_cli --output-format text ops sumeragi telemetry(orGET /v1/sumeragi/telemetry) to inspectrbc_backlogand vote ingestion, then fetch/v1/sumeragi/rbcand/v1/sumeragi/rbc/sessionsto list active payloads, chunk counts, and recovery flags. - Inspect backlog counters. Watch
sumeragi_rbc_backlog_chunks_total,sumeragi_rbc_backlog_chunks_max, andsumeragi_rbc_backlog_sessions_pending. Sustained non-zero values over five minutes (e.g.,max_over_time(sumeragi_rbc_backlog_chunks_max[5m]) > 0) imply slow chunk delivery; correlate withready_countvsdeliveredin the session snapshot. - Check DA availability warnings. Alert on spikes in
sumeragi_da_gate_block_total{reason="missing_local_data"};sumeragi_rbc_da_reschedule_totalis legacy and should remain zero in current pipelines. - Evaluate pacemaker deferrals and proposal backpressure. Use
increase(sumeragi_pacemaker_backpressure_deferrals_total[5m]),increase(sumeragi_pacemaker_backpressure_deferrals_by_reason_total{reason="..."}[5m]),max_over_time(sumeragi_pacemaker_backpressure_deferral_age_ms{reason="..."}[5m]),max_over_time(sumeragi_tx_queue_saturated[5m]),max_over_time(sumeragi_pending_blocks_blocking[5m]),max_over_time(sumeragi_commit_inflight_queue_depth[5m]),sumeragi_rbc_backlog_*, and relay drop/backpressure counters to confirm whether the pacemaker halted due to queue saturation, relay/RBC backlog, or blocking pending blocks. Combine withincrease(gossip_fallback_total[5m])andincrease(block_created_proposal_mismatch_total[5m])to surface collectors retrying without progress. - Review logs and network health. Filter consensus logs for
rbcandpacemaker_backpressure_deferralto spot repeated retries, DA restarts, or queue pressure. Cross-check P2P metrics (p2p_queue_depth{priority=...},p2p_dropped_posts,p2p_dropped_broadcasts,p2p_subscriber_queue_full_total,p2p_subscriber_queue_full_by_topic_total{topic=...},p2p_subscriber_unrouted_total,p2p_subscriber_unrouted_by_topic_total{topic=...}) and payload ingress drops (consensus_ingress_drop_total{topic="ConsensusPayload|ConsensusChunk|BlockSync",reason="rate|bytes|rbc_session_limit|penalty"}) to identify network bottlenecks; adjust collector fan-out, queue capacity, or baseline load accordingly. - Correlate logs automatically. Run
python3 scripts/sumeragi_backpressure_log_scraper.py <logfile>to list each pacemaker deferral together with nearby missing-availability entries. Adjust--window-before/--window-afterto match your alert window and add--status path/to/status.jsonwhen you have a/v1/sumeragi/statussnapshot handy. The script prints a human-readable summary by default and supports--jsonfor feeding structured reports into on-call automation. - Escalate persistent issues. If backlog/deferral metrics stay high beyond two blocks:
- Freeze new client submissions via admission rate limiting.
- Manually inspect problematic sessions with
iroha_cli --output-format json ops sumeragi telemetryto confirm which height/view is stuck. - Consider increasing
sumeragi.advanced.rbc.chunk_max_bytesor provisioning additional bandwidth before re-enabling full load.
Availability collectors expose vote ingestion counters: sumeragi_da_votes_ingested_total, sumeragi_da_votes_ingested_by_collector{collector_idx="..."}, and sumeragi_da_votes_ingested_by_peer{peer="..."}. Availability-evidence assembly latency is recorded via the histogram sumeragi_qc_assembly_latency_ms{kind="availability"} with the latest observed latency mirrored in sumeragi_qc_last_latency_ms{kind="availability"}.
Pacemaker configuration (Sumeragi)
sumeragi.advanced.pacemaker.backoff_multiplier(u32): scales each timeout backoff step (default 1).sumeragi.advanced.pacemaker.rtt_floor_multiplier(u32): RTT floor multiplier; floor = avg_rtt * multiplier (default 2).sumeragi.advanced.pacemaker.max_backoff_ms(u64): backoff cap in milliseconds (default 60000).sumeragi.advanced.pacemaker.jitter_frac_permille(u32): jitter band in permille of the window (0..=1000, default 0 = off).
Notes
- All metrics have deterministic semantics across hardware. Parallel paths publish counters only after deterministic commit.
- Extend dashboards with Torii endpoint metrics (
torii_*) once wired; see roadmap for status.
Prometheus rules surfacing randomness degradation:
alert: SumeragiVrfNoParticipation
expr: increase(sumeragi_vrf_no_participation_total[140m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Validator skipped VRF commit and reveal windows"
description: |
Non-participation penalties incremented (count={{ $value }}). Inspect `iroha_cli ops sumeragi vrf-epoch --epoch <current>`
to identify the offline signer and stage reconfiguration or slashing if the validator cannot recover.
alert: SumeragiVrfNonReveal
expr: increase(sumeragi_vrf_non_reveal_penalties_total[140m]) > 0
labels:
severity: warning
annotations:
summary: "Validator missed VRF reveal deadline"
description: |
Non-reveal penalties incremented (count={{ $value }}). Use `sumeragi_vrf_non_reveal_by_signer` labels to page the validator and re-submit the reveal.
alert: SumeragiVrfPayloadReject
expr: increase(sumeragi_vrf_rejects_total_by_reason{reason!="late"}[5m]) > 0
labels:
severity: warning
annotations:
summary: "VRF payload rejected"
description: |
Instance {{ $labels.instance }} rejected a VRF payload for reason {{ $labels.reason }}. Confirm the validator is on the latest parameters and re-issue the payload.
alert: SumeragiVrfCommitStall
expr: rate(sumeragi_vrf_commits_emitted_total[5m]) == 0 and increase(sumeragi_blocks_committed_total[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Epoch commitments stalled while blocks continue"
description: |
No VRF commitments observed during an active epoch. Verify validators are online and check `/v1/sumeragi/status.vrf_penalty_epoch`.
Adjust the 140 minute window to match your deployment’s sumeragi.npos.vrf.commit_deadline_offset_blocks + sumeragi.npos.vrf.reveal_deadline_offset_blocks (defaults assume one-second blocks). Link alerts to the response checklist in docs/source/sumeragi.md#vrf-alert-response-runbook.
The sample rule group lives at docs/source/references/prometheus.rules.sumeragi_vrf.yml; include it from your Prometheus configuration (see docs/source/references/prometheus.template.yml for an example rule_files entry).
Run scripts/check_prometheus_rules.sh to validate the rules locally. The helper invokes promtool check rules if Prometheus is installed, or falls back to docker run --rm prom/prometheus … when Docker is available.
Pre-auth connection gating exposes two metrics:
torii_pre_auth_reject_total{reason}— counter of rejected connections. Reasons:global_cap,ip_cap,rate,ban.torii_operator_auth_total{action,result,reason}— operator auth events;actionisgate|register_options|register_verify|login_options|login_verify,resultisallowed|denied|rate_limited|locked, andreasonmirrors the auth error labels.torii_operator_auth_lockout_total{action,reason}— operator auth lockouts per action and failure reason.torii_contract_throttled_total{endpoint}— contract API requests rejected by the deploy limiter (endpoint=code,deploy,activate).torii_contract_errors_total{endpoint}— contract API requests that failed for other reasons (missing token, queue error, etc.).torii_active_connections_total{scheme}— gauge tracking concurrent connections per scheme (http,ws).
Suggested alert for sustained rejections:
alert: ToriiPreAuthRejects
expr: increase(torii_pre_auth_reject_total[5m]) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Torii pre-auth gating is rejecting clients"
description: |
Instance {{ $labels.instance }} rejected {{ $value }} connections (reason={{ $labels.reason }}).
Inspect Torii pre-auth configuration and recent traffic patterns. Check allowlists for trusted operators.
Operational guidance:
- Track
torii_active_connections_totalduring incidents to confirm capacity pressure. - Use CIDR allowlists for monitoring systems that should bypass gating.
- When bans trigger repeatedly, inspect the offending IP, update policies, and clear bans by restarting Torii if needed.
Refer to sorafs/provider_advert_rollout.md for the R0–R3 enforcement
timeline, Grafana export (grafana_sorafs_admission.json), and the canonical
alert wiring that Observability keeps in sync across environments.
Norito-RPC transport telemetry requirements are captured in docs/source/torii/norito_rpc_telemetry.md. Key metrics:
torii_request_duration_seconds_bucket{scheme}— scheme-level latency histogram; filter onscheme="norito_rpc"for burn-in dashboards.torii_request_failures_total{scheme,code}— error counter keyed by connection scheme and HTTP status.torii_http_requests_total{content_type,status,method}— request counter exposingcontent_type="application/x-norito"for Norito calls.torii_http_request_duration_seconds_bucket{content_type,method}— latency histogram for content-type parity dashboards.torii_http_response_bytes_total{content_type,method,status}— optional payload-size counter for regression detection.torii_norito_decode_failures_total{payload_kind,reason}— bucketed Norito RPC decode failures (invalid magic, checksum mismatch, unsupported feature, etc.).torii_address_invalid_total{surface,reason}— rejects grouped by Torii surface (e.g.,routing.source,iso_bridge.source) and the stable address error code so SDK drift or malformed I105 literals are visible.- Local‑8 specific counters are retired; rely on
torii_address_invalid_total{surface,reason}andtorii_address_collision_total{surface,kind}to monitor address ingestion and Local‑12 safety. - Existing gauges (
torii_active_connections_total{scheme},torii_pre_auth_reject_total{reason}) must includescheme="norito_rpc"to track transport gating.
Alert rules live in dashboards/alerts/torii_norito_rpc_rules.yml (see companion test dashboards/alerts/tests/torii_norito_rpc_rules.test.yml). Highlights:
ToriiNoritoRpcErrorSpikewarns when Norito 5xx responses exceed allowable thresholds for five minutes.ToriiNoritoRpcLatencyDegradedfires when Norito P95 latency breaches 750 ms.ToriiNoritoRpcSilentTrafficdetects 30 minutes without Norito requests, signalling misrouted traffic.
Use scripts/telemetry/test_torii_norito_rpc_alerts.sh to run promtool test rules locally or in CI so alert updates stay in sync with dashboards.
- Dashboards.
dashboards/grafana/sorafs_fetch_observability.jsonvisualises ingestion latency (histogram_quantileoversorafs_orchestrator_fetch_duration_ms/sorafs_orchestrator_chunk_latency_ms), retry/backoff counters, and transport errors so DA fetch regressions are visible during burn-in.dashboards/grafana/sorafs_gateway_observability.jsonfocuses on replication/backlog metrics such astorii_sorafs_replication_backlog_total,torii_sorafs_replication_deadline_slack_epochs, PoR ingest queues (torii_sorafs_por_ingest_backlog,torii_sorafs_por_ingest_failures_total), and gateway refusal/throttle counters.dashboards/grafana/sorafs_capacity_penalties.jsontracks retention (torii_sorafs_storage_bytes_{used,capacity}), accumulated GiB·hours, replication SLA outcomes (torii_sorafs_replication_sla_total), disputes/slash proposals, and fee projections so finance/storage stakeholders review the same evidence bundle.dashboards/grafana/sorafs_capacity_health.jsongraphes declared/effective/utilised GiB, outstanding reservations, backlog depth, and expired-order rates per provider to highlight saturation before SLO drift.
- Chunking cost.
torii_da_chunking_secondstracks DA chunking + erasure coding CPU time; monitor p95 viahistogram_quantile(0.95, sum(rate(torii_da_chunking_seconds_bucket[5m])) by (le)). - Proof health. PDP/PoTR pass rates surface through
torii_sorafs_proof_stream_events_total{kind="pdp|potr",result}; latency distributions are published viatorii_sorafs_proof_stream_latency_ms. Pair them withtorii_sorafs_storage_por_samples_{success,failed}_totalto prove challenge coverage and link panels directly to provider alerts. - Retention digest.
docs/source/status/sorafs_da_weekly_digest.mdis the weekly template mandated by DA-9. Populate it with snapshots exported from the dashboards above plus the PromQL snippets used to compute p95 ingestion latency and backlog deltas; attach the Markdown to the Friday digest mail that SRE sends to Storage + Product. - Alerting. Import
dashboards/alerts/sorafs_fetch_rules.yml,dashboards/alerts/sorafs_gateway_rules.yml, anddashboards/alerts/sorafs_capacity_rules.ymlso ingestion, gateway, and capacity saturation signals stay covered. Validate each pack with its fixture (dashboards/alerts/tests/sorafs_fetch_rules.test.yml,dashboards/alerts/tests/sorafs_capacity_rules.test.yml) viapromtool test rules …(orscripts/check_prometheus_rules.sh dashboards/alerts/sorafs_*) before publishing changes. - Runbook integration. Incident tickets tied to DA need three artefacts: (1) Grafana JSON exports (
sorafs_fetch_observabilityandsorafs_gateway_observability), (2) the filled digest template, and (3) alert UUIDs from the rule groups above. This satisfies the DA-9 roadmap requirement that SRE/operator reviews include dashboards plus a narrative digest.
Roadmap item M2.2 — Gas & Telemetry now exposes per-asset gauges and eviction counters for the confidential commitment tree so operators can prove tree depth, root-history hygiene, and checkpoint behaviour:
iroha_confidential_tree_commitments{asset_id}— current number of commitments (leaves).iroha_confidential_tree_depth{asset_id}— Merkle depth in levels (bounded by the verifier profile).iroha_confidential_root_history_entries{asset_id}— retained historical roots (should matchzk.root_history_cap).iroha_confidential_frontier_checkpoints{asset_id}— number of recorded frontier checkpoints.iroha_confidential_frontier_last_checkpoint_height{asset_id}— height of the last frontier checkpoint (0 when no checkpoint exists).iroha_confidential_frontier_last_checkpoint_commitments{asset_id}— commitment count captured alongside the last checkpoint.iroha_confidential_root_evictions_total{asset_id}— counter of root-history evictions after enforcing the cap.iroha_confidential_frontier_evictions_total{asset_id}— counter of frontier checkpoint evictions when the interval/depth window trims history.
Grafana panel dashboards/grafana/confidential_assets.json graphs the gauges
and pairs the eviction counters with alert rules that fire when roots or
checkpoints churn faster than expected. Capture evidence with a simple scrape:
curl -s http://127.0.0.1:8180/metrics \
| rg 'iroha_confidential_(tree_(commitments|depth)|root_history_entries|frontier_(checkpoints|last_checkpoint_height|last_checkpoint_commitments)){asset_id="xor#wonderland"}'Pair this with rg 'iroha_confidential_(root|frontier)_evictions_total' to prove
the eviction counters advanced after a maintenance window. The calibration doc
(docs/source/confidential_assets_calibration.md) records the signed baselines
and links to the corresponding governance acknowledgement.
The same dashboard now surfaces the Halo2 verifier cache counters exposed via
iroha_zk_verifier_cache_events_total{cache,event} (cache = vk |
builtin, event = hit | miss). Use these counters to compute the 5-minute
miss ratio:
curl -s http://127.0.0.1:8180/metrics \\
| rg 'iroha_zk_verifier_cache_events_total{cache="vk",event="(hit|miss)"}'dashboards/alerts/confidential_assets_rules.yml ships the
ConfidentialVerifierCacheMissSpike warning whenever misses exceed 40%
of lookups for ten minutes, ensuring cache regressions are caught alongside the
tree-depth guard.
Use this checklist when the Norito transport fails SLOs or generates alerts:
- Primary signals:
ToriiNoritoRpcErrorSpike,ToriiNoritoRpcLatencyDegraded,ToriiNoritoRpcSilentTraffic,torii_norito_decode_failures_total, andtorii_request_duration_seconds_bucket{scheme="norito_rpc"}panels indashboards/grafana/torii_norito_rpc_observability.json. - Client corroboration: SDK CI and mock-harness jobs export
torii_mock_harness_retry_total,torii_mock_harness_duration_ms, andtorii_mock_harness_fixture_version; spikes in these client-side metrics usually indicate regressions before production traffic is impacted. - Support tooling:
python/iroha_python/scripts/run_norito_rpc_smoke.shvalidates end-to-end request/response flows, andscripts/telemetry/test_torii_norito_rpc_alerts.shconfirms alert expressions continue to match the on-disk rules after remediation.
Immediate triage
- Determine the failure mode:
- Error spike:
torii_request_failures_total{scheme="norito_rpc"}or decode counters rise. - Latency breach:
histogram_quantile(0.95, torii_request_duration_seconds_bucket{scheme="norito_rpc"})exceeds 750 ms. - Silent traffic:
torii_request_duration_seconds_count{scheme="norito_rpc"}flat-lines while HTTP/JSON remain healthy.
- Error spike:
- Scope the impact with dashboards, then inspect Torii logs filtered on
ConnScheme::NoritoRpcforschema_mismatch, checksum, or TLS errors. Decode failures emit the exactreasontag shown in telemetry. - Verify edge ingress still forwards the binary payloads by running
curl -H 'Content-Type: application/x-norito' https://<torii-host>/rpc/ping(or the mock harness) from multiple regions. A JSON-looking response indicates a proxy stripped the header. - Compare server metrics with SDK mock-harness counters. If retries spike in CI but production traffic is normal, halt the client rollout and coordinate with SDK owners instead of throttling Norito globally.
Mitigation options
- Misbehaving clients: Gate them via
torii.preauth_scheme_limits.norito_rpcuntil parity is restored; the config lives underclient_apiiniroha_config. - Decode or schema mismatches: Ensure Torii and SDKs run the same fixture bundle by checking
fixtures/norito_rpc/schema_hashes.json(the DTO→hash table); regenerate fixtures (cargo xtask norito-rpc-fixtures --all) if hashes diverge. - Ingress/proxy issues: Fix header forwarding or MTU settings, then rerun
python/iroha_python/scripts/run_norito_rpc_smoke.sh. - Full brownout / rollback: If service impact persists, flip
torii.transport.norito_rpc.stagetocanaryordisabled(perdocs/source/torii/norito_rpc_rollout_plan.md), reload Torii, and ensure/rpc/capabilitiesreports the downgraded stage so SDKs fall back to JSON without guessing. Record the change in the canary runbook (docs/source/runbooks/torii_norito_rpc_canary.md).
Verification
- Watch
torii_request_failures_totaland the decode counter return to baseline; clear Alertmanager silences only after the metrics stay flat for one evaluation period. - Confirm
torii_active_connections_total{scheme="norito_rpc"}stabilises and theToriiNoritoRpcSilentTrafficalert stays green. - Re-run the Norito RPC smoke test (
python/iroha_python/scripts/run_norito_rpc_smoke.sh) and alert tests (scripts/telemetry/test_torii_norito_rpc_alerts.sh). - Capture evidence (Grafana PNGs, config patches, CLI outputs) and attach it to the NRPC-2 runbook ticket plus
status.mdso the roadmap artifact remains auditable.
A new Prometheus counter sumeragi_membership_mismatch_total{peer,height,view} and gauge sumeragi_membership_mismatch_active{peer} were introduced to detect validator roster divergence. /v1/sumeragi/status now surfaces a membership_mismatch block with the active peer list and last mismatch context to speed triage.
The gauges sumeragi_membership_view_hash, sumeragi_membership_height, sumeragi_membership_view, and sumeragi_membership_epoch expose the deterministic membership hash together with the (height, view, epoch) context. Compare these values across peers to confirm roster alignment without waiting for mismatch alarms.
Recommended alert (recorded in the runbook):
alert: SumeragiMembershipMismatch
expr: increase(sumeragi_membership_mismatch_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Consensus membership mismatch detected"
description: |
Node {{ $labels.instance }} observed validator membership mismatch for peer {{ $labels.peer }} at height {{ $labels.height }} view {{ $labels.view }}.
Investigate peer configuration, on-chain `SumeragiParameters`, and recent key rotation events.
Operations checklist:
- Verify the mismatch is expected (e.g., pending topology change) via
/v1/sumeragi/status. - If unexpected, quarantine the offending peer and confirm configuration files match the on-chain roster.
- After remediation, ensure
sumeragi_membership_mismatch_active{peer}returns to0.
The Nexus scheduler exports TEU-focused metrics once the lane router lands. All
series are exposed behind the metrics capability and default to 0 in
single-lane mode so dashboards can be provisioned early.
Key metrics:
nexus_scheduler_lane_teu_capacity{lane}(gauge) — per-lane TEU cap.nexus_scheduler_lane_teu_slot_committed{lane}(gauge) — TEU used during the most recent slot.nexus_scheduler_lane_teu_slot_breakdown{lane,bucket}(gauge) — stacked view of floor/headroom/must-serve/circuit-breaker consumption.nexus_scheduler_lane_teu_deferral_total{lane,reason}(counter) — TEU deferred because the lane hit a cap, quota, or envelope limit.nexus_scheduler_dataspace_teu_backlog{lane,dataspace}(gauge) — queued TEU demand remaining after scheduling.nexus_scheduler_dataspace_age_slots{lane,dataspace}(gauge) — slots since the dataspace was last served.nexus_scheduler_starvation_bound_slots{lane}(gauge) — configured starvation bound applied to the lane.nexus_scheduler_must_serve_truncations_total{lane}(counter) — truncated must-serve slices.nexus_scheduler_lane_trigger_level{lane}(gauge) — current circuit-breaker tier.
The /status endpoint mirrors these gauges and now ships richer per-lane
pipeline summaries via nexus_scheduler_lane_teu_status responses. Each lane
snapshot includes the scheduler graph counters (tx_vertices, tx_edges,
overlay_count, overlay_instr_total, overlay_bytes_total, rbc_chunks,
rbc_bytes_total) plus:
peak_layer_width,layer_count— outer bounds of the scheduler layering that executed for the lane in the latest block.avg_layer_width,median_layer_width,scheduler_utilization_pct— basic moments for histogram dashboards.layer_width_buckets[le]— monotonically increasing buckets (le = 1,2,4,8,16,32,64,128) matching the Prometheus histogram shown on Grafana.manifest_required,manifest_ready— whether the lane requires a manifest and if one is loaded.manifest_path— best-effort path of the active manifest (surfaced for operators).manifest_validators,quorum— validator roster and quorum declared by the manifest.protected_namespaces— namespaces gated by the lane’s governance policy.runtime_upgrade— snapshot of the runtime-upgrade hook (allow,require_metadata,metadata_key,allowed_ids).dataspace_id,dataspace_alias— identify which dataspace the lane services, matching the scheduler backlog gauges.storage_profile— lane storage strategy (full_replica,commitment_only,split_replica).detached_prepared,detached_merged,detached_fallback— detached overlay execution counters sized to the lane.quarantine_executed— quarantine fallbacks the lane had to drain during the block.
The sister collection nexus_scheduler_dataspace_teu_status mirrors the
per-dataspace backlog view and reports tx_served plus the dataspace
fault_tolerance (f) so operators can see the configured lane-relay committee
sizing (3f+1) in the same status payload.
The transaction queue now keeps these snapshots warm in between block commits.
Routing rules can specify dataspace aliases, so ConfigLaneRouter resolves both
lane and dataspace IDs before the queue updates telemetry.
Every time a transaction is enqueued or drained from consensus, the queue
recomputes pending TEU per lane/dataspace and updates the gauges. Operators can
therefore watch nexus_scheduler_dataspace_teu_backlog to understand backlog
pressure even before the next slot envelope is assembled. Lane headroom now
reflects remaining capacity (capacity - min(pending_teu, capacity)) instead of
mirroring backlog directly, so nexus.scheduler.headroom only fires when a lane
actually consumes most of its configured TEU budget rather than whenever the
queue is non-empty.
nexus_config_diff_total{knob,profile}(counter) — increments whenever the active Nexus configuration diverges from the single-lane baseline. Theknoblabel identifies the section that changed (for examplenexus.lane_catalog.count,nexus.routing.rules,nexus.da), andprofileis set toactive.
Alert snippet:
increase(nexus_config_diff_total{profile="active"}[5m]) > 0
Page outside planned maintenance windows. Each increment also emits a telemetry log entry nexus.config.diff with a Norito JSON payload listing baseline and current values — review it during config rollouts and the TRACE-TELEMETRY-BRIDGE dry-run to confirm the expected knobs moved.
nexus_lane_configured_total(gauge) — reports how many Nexus lane catalog entries the node has applied. Compare it against the expected lane count (for example,1for single-lane bundles or3for Nexus multi-lane deployments) to catch misconfigured peers ahead of routed-trace audits. Whennexus.enabled=false, lane/dataspace metrics (including this gauge) are reset and filtered out of/metricsand/statusso Iroha 2 deployments stay lane-free.
Alert snippet:
nexus_lane_configured_total != EXPECTED_LANE_COUNT
Example PromQL snippets:
- Lane headroom:
nexus_scheduler_lane_teu_capacity - nexus_scheduler_lane_teu_slot_committed. - Starvation bound:
max_over_time(nexus_scheduler_dataspace_age_slots[5m])vsmax(nexus_scheduler_starvation_bound_slots). - Deferrals per reason:
rate(nexus_scheduler_lane_teu_deferral_total{reason="cap_exceeded"}[10m]). - Truncated must-serve:
increase(nexus_scheduler_must_serve_truncations_total[30m]).
Dashboard:
- Import
docs/source/grafana_scheduler_teu.json. Panels include lane capacity stats, stacked bucket timeseries, backlog heatmap, starvation bound trend, and trigger status table. Set$PROM_DSto your Prometheus data source after import.
Alert thresholds:
- Cap exhaustion: trigger when
rate(nexus_scheduler_lane_teu_deferral_total{reason="cap_exceeded"}[5m]) > 0for15m. - Starvation bound: trigger when
max(nexus_scheduler_dataspace_age_slots) >= max(nexus_scheduler_starvation_bound_slots)for5m. - Must-serve truncation: ticket when
increase(nexus_scheduler_must_serve_truncations_total[1h]) > 0.
Operator triage:
- Run
iroha_cli app nexus lane-report --lane <id>to inspect the lane cap, configured bound, and backlog snapshot (CLI update tracked under Nexus router workstreams). - Check
nexus_scheduler_lane_trigger_level(tier>0implies a circuit-breaker is reducing caps); referencedocs/source/nexus_transition_notes.mdfor trigger semantics. - Inspect Torii logs (
pipeline::scheduler) for per-slot summaries including bucket breakdowns and queue diffs. - If starvation continues, verify
routing_policyand dataspace quotas in configuration; confirm peers share identical catalog hashes.
- Telemetry log
nexus.lane.topology— emitted whenever a lane is provisioned, retired, or renamed. Each Norito JSON payload includes:action:provisioned,retired, oralias_migrated.lane_id,alias,slug,dataspace_id,visibility, andstorage_profilefor new lanes.- For migrations:
alias_before,alias_after,slug_before,slug_after.
Operator workflow:
- Tail
journalctl -u irohad -o json | jq 'select(.msg==\"nexus.lane.topology\")'(or subscribe via the OTLP bridge) whenever the governance catalog changes so you capture an immutable record of the storage changes. - Store the JSON payloads with the change request; auditors cross-check them with
nexus_lane_configured_totaland Kura’s directory layout. - Add alert rules when unexpected events arrive (for example,
action="retired"outside a maintenance window). A simple Loki/Grafana expression iscount_over_time({job="irohad"} |= "nexus.lane.topology" | json | action="retired" [5m]) > 0. - Feed the events into automation such as
scripts/nexus_lane_smoke.py --from-telemetry telemetry.ndjson --require-alias-migration alpha:paymentsso the smoke tests can assert that relabeling occurred after a rename. The helper rejects telemetry logs that do not contain the expected alias entries, making it safe to gate CI on the recordednexus.lane.topologypayloads.
-
When per-lane headroom drops below 15 % of its configured capacity, hits zero, or
nexus_scheduler_lane_trigger_level > 0, Iroha emits the structured lognexus.scheduler.headroom. Each Norito payload recordslane_id,capacity,committed,headroom_teu,headroom_pct, bucket breakdown,trigger_level,starvation_bound_slots, and cumulative deferrals. Tail the log during drills with:journalctl -u irohad -o json | jq 'select(.msg=="nexus.scheduler.headroom")'
-
Prometheus exposes
nexus_scheduler_lane_headroom_events_total{lane_id}so you can alert when the log fires more than a handful of times per hour outside rehearsals. Combine it withnexus_scheduler_lane_teu_slot_breakdownto tune TEU alerts. -
After uploading the telemetry pack, run
scripts/telemetry/validate_nexus_telemetry_pack.py --pack-dir <dir> --workload-seed NEXUS-REH-2026Q1 --slot-range 820-860to generatetelemetry_manifest.json+.sha256for the evidence bucket and tracker.
The OTLP exporter used during rehearsals throttled when its batch queue capped
out. Configure your collector (for example otelcol-contrib) with a 256-sample
batch to keep otlp.ndjson captures within the rehearsal window:
exporters:
otlphttp:
endpoint: https://telemetry.example.net/otlp
processors:
batch:
timeout: 5s
send_batch_size: 256
send_batch_max_size: 256
extensions:
health_check: {}
service:
extensions: [health_check]
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp]NX-18 introduces a dedicated metric set for the 1 s finality gate. The gauges
and histograms ship in every build (single-lane deployments simply emit steady
0/baseline values) so operators can configure dashboards and alerts ahead of
the Nexus cut-over. All metrics back the Grafana board stored in
dashboards/grafana/nexus_lanes.json and the runbook documented in
docs/source/runbooks/nexus_lane_finality.md.
| Metric | Description | SLO / Action |
|---|---|---|
histogram_quantile(0.95, iroha_slot_duration_ms) |
Slot-duration histogram derived from the end of every Sumeragi slot. | Keep p95 ≤ 1 000 ms (warning at 950 ms). Breaches must trigger the slot runbook and be recorded in the NX-18 drill log. |
iroha_slot_duration_ms_latest |
Gauge of the most recent slot duration. | Capture alongside the histogram whenever filing incidents; sustained spikes > 1 100 ms indicate an unhealthy validator even if quantiles remain green. |
iroha_da_quorum_ratio |
Rolling fraction of slots that satisfied the DA quorum window. | Target ≥ 0.95; combine with increase(sumeragi_da_gate_block_total{reason="missing_local_data"}[5m]) to locate failing attesters or timeouts. |
sumeragi_rbc_da_reschedule_total |
Legacy counter for DA-driven slot reschedules (no longer incremented). | Keep at zero; investigate missing-availability counters instead (see ops/runbooks/da-quorum.md). |
iroha_oracle_price_local_per_xor |
Latest TWAP reported by the lane-specific oracle. | Watch for spikes when swap lines are thin; tie haircuts to treasury reports. |
iroha_oracle_staleness_seconds |
Seconds since the last oracle refresh. | Alert at ≥ 75 s; fail the NX-18 gate at ≥ 90 s until the feed is restarted. |
iroha_oracle_twap_window_seconds |
Effective TWAP window length. | Should remain at 60 s ± 5 s; deviations mean the oracle config drifted. |
iroha_oracle_haircut_basis_points |
Applied haircut per lane/dataspace. | Compare against the liquidity tier table before approving router changes. |
iroha_settlement_buffer_xor / iroha_settlement_buffer_capacity_xor |
Remaining buffer headroom and configured capacity. | Soft alert at 25 %, hard alert at 10 %; below the hard threshold force XOR-only routing and log the incident. |
scripts/telemetry/check_slot_duration.py --json-out artifacts/nx18/slot_summary.jsonparses Prometheus dumps and enforces the p95/p99 gates. CI wires this viaci/check_nexus_lane_smoke.shso every release candidate ships the JSON summary next to the metrics snapshot (fixtures/nexus/lanes/metrics_ready.promprovides the sample pack for local validation).scripts/telemetry/bundle_slot_artifacts.py --metrics <metrics.prom> --summary artifacts/nx18/slot_summary.json --out-dir artifacts/nx18emitsslot_bundle_manifest.jsonand SHA-256 digests for the required artefacts.scripts/run_release_pipeline.pyinvokes it automatically (skip with--skip-nexus-lane-smoke) so NX-18 sign-offs include immutable evidence.- The chaos/acceptance harness now runs
scripts/telemetry/nx18_acceptance.py --json-out artifacts/nx18/nx18_acceptance.json <metrics.prom>insideci/check_nexus_lane_smoke.shto gate DA quorum, oracle staleness/TWAP/haircuts, settlement buffers, and slot quantiles in one place. Keep the thresholds in the script aligned with the dashboards/alert rules. - Capture routed-trace telemetry with
scripts/telemetry/check_nexus_audit_outcome.pyand archive the resulting JSON underdocs/examples/nexus_audit_outcomes/. The tool enforces that every rehearsal produced anexus.audit.outcomeevent and keeps theTRACE-TELEMETRY-BRIDGEcheckpoints auditable. - Record all drills with
scripts/telemetry/log_sorafs_drill.sh --log ops/drill-log.md --program NX-18 --status <status>so quarter-end reports can enumerate slot, DA, oracle, and buffer rehearsals.
- Dashboards: import
dashboards/grafana/nexus_lanes.jsonand keep it in sync with production panels, including quantile/ratio thresholds and Alertmanager annotations. - Runbooks: use
docs/source/runbooks/nexus_lane_finality.mdfor finality/oracle procedures andops/runbooks/da-quorum.mdfor DA-specific mitigations. Both documents call out the metrics above plus evidence requirements. - Release gates: ensure
status.mdentries link to the latest NX-18 slot bundle and Grafana captures whenever preparing a release candidate; the roadmap requires signed artefacts before multi-lane code paths are enabled by default.
Environment variables provide the same knobs (OTEL_BLRP_MAX_EXPORT_BATCH_SIZE
and OTEL_METRIC_EXPORT_INTERVAL). Set both to 256 and 5s respectively
during Nexus rehearsals, then archive the resulting otlp.ndjson inside the
telemetry pack before running the validation script above.
telemetrylognexus.audit.outcome— emitted viaTelemetry::record_audit_outcomewhenever a routed-trace checkpoint completes. The Norito payload includestrace_id,slot_height,reviewer,status(for examplepass,fail,mitigated), and an optionalmitigation_url.- Prometheus surfaces
nexus_audit_outcome_total{trace_id,status}andnexus_audit_outcome_last_timestamp_seconds{trace_id}so dashboards and alert rules can track routed-trace health.
Operator workflow:
- During
TRACE-*rehearsals, tail the telemetry stream (journalctl -u irohad -o jsonor OTLP bridge) and confirm an event appears for each scheduled audit window. - Archive the JSON payload alongside the audit artefacts so reviewers can trace the verdict and mitigation link.
- Run
scripts/telemetry/check_nexus_audit_outcome.pyagainst the telemetry log (for example,--trace-id TRACE-TELEMETRY-BRIDGE --window-start <ISO time> --window-minutes 30). The helper enforces that a matching payload exists, fails the run if a disallowed status (defaultfail) is observed, and stores the JSON artefact underdocs/examples/nexus_audit_outcomes/for audit records. - Alert when
status="fail"or if no event is observed within 30 minutes of the expected audit slot; the Prometheus ruledashboards/alerts/nexus_audit_rules.ymlfires on failing statuses, while CI should integrate the script above to gate the “missing outcome” requirement.
Prometheus metrics exposed by the Norito streaming runtime:
streaming_encode_latency_ms: histogram of publisher encode latency.streaming_encode_audio_jitter_ms: EWMA audio jitter observed at the publisher (ms).streaming_encode_audio_max_jitter_ms: gauge tracking the peak audio jitter observed (ms).streaming_encode_dropped_layers_total: counter of rendition layers dropped during encode.streaming_decode_buffer_ms: viewer buffer depth histogram.streaming_decode_dropped_frames_total: counter for decoder frame drops.streaming_decode_max_queue_ms: histogram of max decode queue depth.streaming_decode_av_drift_ms: histogram of absolute audio/video drift measured at viewers.streaming_decode_max_drift_ms: gauge for maximum audio/video drift observed (ms).streaming_audio_jitter_ms,streaming_audio_max_jitter_ms: viewer-reported audio jitter histogram + peak gauge derived fromReceiverReportdiagnostics.streaming_av_drift_ms,streaming_av_max_drift_ms: viewer-reported audio/video drift histogram + peak gauge (absolute milliseconds).streaming_av_drift_ewma_ms: signed gauge tracking the viewer EWMA drift used for throttling decisions.streaming_av_sync_window_ms: gauge exposing the active aggregation window (ms) advertised by viewers.streaming_av_sync_violation_total: counter incremented when viewers flag segments beyond the ±10 ms sync budget.streaming_network_rtt_ms,streaming_network_loss_percent_x100,streaming_network_fec_{repairs,failures}_total,streaming_network_datagram_reinjects_total: network health metrics.streaming_energy_encoder_mw,streaming_energy_decoder_mw: power usage gauges from publishers/viewers.
Telemetry payloads emitted over Norito (TelemetryEncodeStats, TelemetryDecodeStats, and the new SyncDiagnostics bundle carried inside ReceiverReport) now include audio jitter and drift fields so dashboards can surface out-of-sync segments. See docs/source/project_tracker/nsc28b_av_sync_telemetry.md for the enforcement rollout plan.