From alert to approved action — Nerve agent runbooks

The gap after alerting

Alertmanager, Zabbix, Datadog, Grafana, and CI systems are good at detecting trouble. The weak part is often the next action: someone opens a laptop, finds the right SSH key, runs a command from memory, and hopes it is the right host.

Nerve keeps your existing monitoring stack as the source of truth. It adds a phone-native path for receiving the alert and, where appropriate, approving a narrow runbook action through an agent you control.

Correct model: signal first, action second

SignalSend-only DSN from monitoring or CI. It cannot read history or execute commands.

DecisionA person sees the context and chooses whether a remediation action is appropriate.

ActionAn agent on a trusted host accepts signed, bounded commands. It is a different credential from the sender DSN.

Good actions are boring

The safest first actions are small, reversible, and already documented in a runbook:

restart a single systemd service;
run a health check or collect diagnostics;
clear a known safe cache directory;
trigger a read-only status script;
start a pre-approved rollback wrapper.

A mobile action should not be an open shell. It should be an approved operation with a name, a scope, and a known blast radius.

Isolated agent host

Run the agent as a dedicated Unix user with least privilege. Put remediation logic in wrapper scripts and grant only those wrappers through sudoers or file permissions.

[Service]
User=nerve-agent
Group=nerve-agent
NoNewPrivileges=true
PrivateTmp=true
ProtectHome=true
ProtectSystem=strict
ReadWritePaths=/var/lib/nerve-agent /run/nerve-actions
ExecStart=/usr/local/bin/nerve-agent -server api.nerve.ink:443 -token TOKEN

Do not auto-fix by default

Automatic remediation is tempting, but it can hide symptoms or make incidents worse. Start with human-approved actions. Once a runbook has months of safe history, you can decide whether a narrow automatic action belongs in your monitoring system itself.

Your alerts already fire. What happens next?

The gap after alerting

Correct model: signal first, action second

Good actions are boring

Isolated agent host

Do not auto-fix by default

Related guides