Operations
Operational readiness means that the network can be observed, changed, backed up, and recovered without relying on improvised access to validator hosts.
Observability
- Enable telemetry profiles intentionally. Use
extendedwhen/metricsis needed andfullduring test runs that need detailed Sumeragi operator routes. - Dashboard accepted throughput, rejected throughput, commit latency, queue depth, queue saturation, view changes, dropped consensus messages, and storage pressure.
- Keep status snapshots, metrics scrapes, logs, and deployment configuration in the same incident or benchmark artifact set.
- Alert on sustained queue growth, unexpected rejection spikes, stalled block height, view-change churn, and peer health changes.
Runbooks
- Write runbooks for peer restart, Torii degradation, key compromise, permission mistakes, fee sponsor depletion, stuck queues, and network partition symptoms.
- Include exact read-only checks before write operations, especially for peer registration, permission grants, and parameter changes.
- Keep emergency contacts and escalation rules outside the docs repo if they include private operational data.
- Review runbooks after every incident, rehearsal, or major upgrade.
See Operational Security.
Backups and Recovery
- Back up peer storage according to the recovery point required by the deployment. Validate restores on non-production hosts.
- Keep signed genesis, release metadata, peer config, and key custody records recoverable even if a validator host is unavailable.
- Document whether a recovery procedure rebuilds from genesis, restores from a snapshot, or replaces a failed peer with a new identity.
- Never test restore procedures for the first time during a production incident.
Change Management
- Treat on-chain configuration changes as transactions that require review, preflight reads, authorization, and post-change verification.
- Roll out peer binary upgrades with a compatibility plan and a rollback decision point.
- Avoid changing peer topology, consensus timing, and application workload in the same maintenance window unless the migration plan requires it.
- Record the transaction hashes and block heights for operational changes.
See Hot Reload and Compatibility Matrix.
Capacity Reviews
- Re-run load checks when validator count, hardware, network placement, workload mix, or consensus parameters change.
- Measure warm-up, steady state, and expected peak load rather than relying on a short best-case throughput sample.
- Compare accepted throughput with committed throughput and queue depth. If submitted TPS exceeds committed TPS and queues grow, the network is past its sustainable envelope.