{
"playbook": "disk_cleanup_prod_db",
"version": "1.4.2",
"incident_id": "inc_2026_02_10_1435",
"target_host": "prod-db-03.us-east-1",
"timestamp": "2026-02-10T14:35:43.124Z",
"signature": "ed25519:a8f3b2c1d9e4f5a6b7c8d9e0...",
"steps": [...]
}Automated Remediation
Execute. Verify. Rollback. Zero Manual Intervention.
Idempotent playbooks execute via SSH, kubectl, and cloud APIs. Health verification after every step. Automatic rollback on failure. RBAC prevents runaway automation. <60 second typical execution for routine infrastructure fixes.
From Playbook Selection to Verified Resolution
Five-stage pipeline: receive signed playbook, verify preconditions, execute steps sequentially, verify health, rollback on failure. Every step logged, every action auditable.
Stage 1: Playbook Reception
<50msControl plane sends signed playbook to agent. Agent verifies Ed25519 cryptographic signature, checks timestamp freshness (<5 min), confirms target host matches, and validates playbook version. If any check fails: reject and log failed authorization attempt.
Stage 2: Pre-Execution Checks
<100msVerify resource availability (disk, CPU, memory headroom), dependency checks (required binaries, service accounts, secrets), and conflict detection (no other playbook running, no maintenance window active, host not marked do-not-remediate). If preconditions fail: abort, report, escalate.
Stage 3: Step-by-Step Execution
10-90sExecute steps sequentially (not parallel). Each step captures stdout, stderr, exit code. Step 2 depends on Step 1 completing. Health verification after each step. Timeout enforcement per step (default 60s). If step exceeds timeout: kill, mark failed, trigger rollback.
Stage 4: Health Verification
10-30sVerify desired state achieved: metric thresholds met, HTTP endpoints healthy, no new errors in logs, performance within bounds. Retry logic with exponential backoff (3 attempts). All checks must pass. If any check fails after retries: trigger rollback.
Stage 5: Rollback on Failure
10-60sIf verification fails: execute rollback steps in reverse order, verify rollback successful, report failure to control plane, escalate to human. Operations that can't roll back (deleted files, external API calls) are skipped. Best practice: design for idempotency over rollback.
1. Verify cryptographic signature (Ed25519 public key)
2. Check timestamp freshness (must be <5 minutes old)
3. Verify target host matches agent's hostname
4. Confirm playbook version exists in cache
If ANY check fails: Reject, log failed authorization attempt
If all checks pass: Proceed to executionBefore executing ANY step, verify preconditions:
1. Resource availability:
- Disk space sufficient for logs/temp files
- CPU headroom available (not already at 100%)
- Memory available for execution
2. Dependency checks:
- Required binaries present (kubectl, aws, gcloud)
- Service accounts configured
- Secrets accessible (SSH keys, API tokens)
3. Conflict detection:
- No other playbook currently executing (serialized)
- No maintenance windows active
- Host not marked as "do not remediate"
If preconditions fail: Abort, report, escalate to human
If preconditions pass: Begin execution# Each step has timeout
steps:
- name: clear_temp_files
timeout: 60s # Max 60 seconds
- name: rotate_logs
timeout: 60s
# If step exceeds timeout: KILL, mark failed, trigger rollbackExecute playbook steps sequentially (not parallel):
Step 1: Clear temp files
Command: find /tmp -type f -mtime +7 -delete
Execution: SSH to localhost, run command
Capture: stdout, stderr, exit code
Duration: 3.8 seconds
Result: 100 files deleted, 8.3 GB freed
Step 2: Rotate logs
Command: logrotate -f /etc/logrotate.conf
Execution: SSH to localhost, run command
Capture: stdout, stderr, exit code
Duration: 1.9 seconds
Result: 12 log files rotated, 3.1 GB freed
Step 3: Verify space freed
Command: df -h / | awk 'NR==2 {print $5}' | sed 's/%//'
Capture: stdout = "72" (72% disk usage)
Duration: 0.2 seconds
Result: Disk usage 72% (down from 91%)After execution completes, verify desired state:
Verification checks:
1. Metric check: disk_usage < 80%
2. Service health: HTTP 200 on health endpoint
3. Error log check: No new errors in /var/log/syslog
4. Performance check: Disk I/O latency <20ms
Health verification result:
- Disk usage: 72.1% ✓ (< 80% threshold)
- HTTP health: 200 OK ✓
- Error logs: 0 new errors ✓
- Disk latency: 3ms ✓ (<20ms threshold)
Overall: PASS (all checks passed)Verification failed: Disk usage 89% (expected <80%)
Rollback triggered:
1. Identify rollback steps from playbook
2. Execute rollback steps in reverse order
3. Verify rollback successful
4. Report failure to control plane
5. Escalate to human
Problem: Disk cleanup is one-way (can't restore deleted files)
Action: Escalate to human, incident marked FAILEDReception: 50ms
Precondition checks: 100ms
Execution: 87s (example: disk cleanup)
Health verification: 20s (3 retries)
Rollback: 0s (verification passed, no rollback needed)
Total: 107 seconds (typical)Running a Playbook Twice = Same Result as Running Once
Idempotent operations produce the same result regardless of how many times executed. Network failures, timeouts, retries must not cause duplicate actions or side effects. Every SentienGuard playbook enforces idempotency through state checks, conditional execution, and health verification.
Idempotent function f(x):
f(x) = result
f(f(x)) = result (applying twice = same as applying once)
f(f(f(x))) = result (applying N times = same result)
Non-idempotent function g(x):
g(x) = result_1
g(g(x)) = result_2 (different result!)
g(g(g(x))) = result_3 (keeps changing)First Execution:
Files found: 1,247 files (8.3 GB total)
Files deleted: 1,247 files
Space freed: 8.3 GB
Disk usage: 91% → 72%
Second Execution (Immediate Retry):
Files found: 0 files (already deleted)
Files deleted: 0 files
Space freed: 0 GB
Disk usage: 72% → 72% (no change)
Third Execution:
Files found: 0 files
Disk usage: 72% → 72% (still no change)
Result: Idempotent. Running 1 time or 100 times = same outcome.Why Idempotency Matters
Agent executes Step 2 of 4: Clear temp files
Network connection lost (timeout)
Control plane doesn't receive completion ACK
Question: Did Step 2 complete?
- Maybe yes (completed, but ACK lost in network)
- Maybe no (failed mid-execution)
Safe action: Re-run entire playbook from Step 1
If idempotent:
Step 1 (re-run): No-op (already done)
Step 2 (re-run): No-op (already done)
Step 3: Executes normally
Step 4: Executes normally
Result: Success, no duplicate actions
If NOT idempotent:
Step 1 (re-run): Duplicate action (BAD)
Step 2 (re-run): Duplicate action (BAD)
Result: Side effects, unintended consequencesEngineer reviews incident dashboard:
"Did disk_cleanup_prod_db run? I don't see confirmation."
Engineer clicks: "Re-run playbook manually"
If idempotent:
Playbook checks disk usage: 72% (already fixed)
Conditional: disk_usage > 85% = FALSE
No action taken
Result: Safe, no harm done
If NOT idempotent:
Playbook runs again, deletes more files
Result: Accidental damage from double-executionHow We Enforce Idempotency
Mechanism 1: State Checks Before Execution
# Playbook includes conditional checks
- name: clear_temp_files
action: ssh_command
command: "find /tmp -type f -mtime +7 -delete"
only_if: "disk_usage > 85%" # Skip if already below thresholdMechanism 3: Health Verification as Exit Condition
def execute_playbook_idempotent(playbook):
# Check if already in desired state
if verify_health(playbook.verification):
log("Already in desired state, skipping execution")
return "SUCCESS_NO_ACTION"
# Not in desired state, execute playbook
execute_steps(playbook.steps)
# Verify we reached desired state
if verify_health(playbook.verification):
return "SUCCESS"
else:
return "FAILED"
# First run: disk 91% → execute → disk 72% → SUCCESS
# Second run: disk 72% → already healthy → SUCCESS_NO_ACTIONMechanism 2: Evaluation Logic
def execute_step(step):
# Check condition before running
if step.only_if:
condition_met = evaluate_condition(step.only_if)
if not condition_met:
log("Skipping step: condition not met")
return "SKIPPED"
# Condition met (or no condition), execute
result = run_command(step.command)
return result
# First run:
# Condition: disk_usage > 85%, Current: 91%
# 91 > 85 = TRUE → Execute cleanup
# Second run (retry):
# Condition: disk_usage > 85%, Current: 72%
# 72 > 85 = FALSE → Skip (already done)Mechanism 4: Audit Trail Marks Retries
{
"incident_id": "inc_2026_02_10_1435",
"playbook": "disk_cleanup_prod_db",
"executions": [
{
"execution_id": "exec_001",
"timestamp": "2026-02-10T14:35:43Z",
"result": "success",
"duration": 87
},
{
"execution_id": "exec_002",
"timestamp": "2026-02-10T14:38:15Z",
"result": "success_no_action",
"duration": 2,
"reason": "retry_after_timeout"
}
]
}Incrementing Counters
echo $((counter + 1)) > /var/retry_countcounter = 1, then 2, then 3. Side effects accumulate.
Appending to Files
echo 'Incident resolved' >> /var/log/incidents.logDuplicate log entries on each run.
Sending Notifications
sendmail -t ops@company.com -s 'Disk cleaned'Duplicate emails on each retry.
DB Inserts Without Check
psql -c "INSERT INTO incidents (id, status) VALUES (123, 'resolved')"Duplicate key error or duplicate rows.
Delete Files Matching Criteria
find /tmp -type f -mtime +7 -deleteRun 1: deletes 100 files. Run 2: deletes 0 (already gone). Same outcome.
Set State (Not Increment)
setquota -u postgres 100G 110G 0 0 /Run 1: sets quota. Run 2: quota already set. No change.
Conditional Restarts
systemctl restart nginx # only_if: service_status != activeRun 1: restarts (was failed). Run 2: skips (already active).
Upsert (Update or Insert)
INSERT INTO incidents ... ON CONFLICT (id) DO UPDATE SET status = 'resolved'Run 1: inserts row. Run 2: updates same row. Single row, correct state.
Three Roles Prevent Runaway Automation
Autonomous systems without access controls are dangerous: junior engineers accidentally triggering production playbooks, compromised accounts executing malicious actions, bugs causing cascading failures. RBAC enforces who can see, who can approve, and who can configure.
Observer
Read-Only Access
Allowed
- View incidents, playbooks, audit logs, dashboards
- Search historical incidents
- Export reports (audit logs, compliance evidence)
- View playbook execution results
Denied
- Cannot approve playbook execution
- Cannot trigger playbooks manually
- Cannot modify playbooks or system settings
- Cannot manage users or roles
Use cases: Junior engineers, auditors, executives, external consultants, interns, contractors.
Remediation Authority
Approve & Execute
Allowed
- All Observer permissions
- Approve playbook execution via Slack approval gates
- Manually trigger playbooks from dashboard
- Override autonomous decisions (stop execution mid-flight)
- Acknowledge and close incidents
Denied
- Cannot create, edit, or delete playbooks
- Cannot manage users or assign roles
- Cannot modify system settings (thresholds, integrations)
Use cases: Senior SREs on-call, DevOps engineers managing production, infrastructure team leads.
Administrator
Full Control
Allowed
- All Remediation Authority permissions
- Create, edit, delete playbooks
- Manage users (add, remove, assign roles)
- Configure integrations (Slack, AWS, GCP, Azure)
- Modify system settings (thresholds, environments, anomaly detection)
- Access billing and subscription management
Use cases: DevOps team leads, platform engineering managers, CTO, system administrators.
Environment-Specific RBAC
users:
- email: alice.jones@company.com
roles:
production: observer # Read-only in prod
staging: administrator # Full control in staging
dev: administrator # Full control in dev
- email: bob.chen@company.com
roles:
production: remediation_authority # Can approve prod fixes
staging: remediation_authority
dev: observer # Not involved in dev
- email: charlie.wang@company.com
roles:
production: observer # External auditor
staging: observer # Read-only everywhere
dev: observerSlack Approval Gate
[SentienGuard] Approval Required
Incident: prod-db-03 connection pool exhausted (98%)
Playbook: postgres_connection_reset (confidence 0.94)
This will:
1. Kill idle connections older than 1 hour (estimated: 23)
2. Reset connection pool to default limits
3. Verify new connections successful
Estimated duration: 23 seconds
Risk: May interrupt long-running analytical queries
[Approve] [Deny] [View Details]{
"approved_by": "bob.chen@company.com",
"role": "Remediation Authority",
"timestamp": "2026-02-10T02:47:52.187Z",
"ip_address": "203.0.113.42",
"approval_latency_seconds": 8,
"incident_id": "inc_2026_02_10_0247",
"playbook": "postgres_connection_reset",
"confidence": 0.94
}{
"timestamp": "2026-02-10T02:48:00Z",
"user": "alice.jones@company.com",
"role": "Observer",
"action": "approve_playbook",
"incident_id": "inc_2026_02_10_0247",
"result": "denied_insufficient_permissions",
"ip_address": "203.0.113.42"
}def approve_playbook(user, incident_id):
# Server-side enforcement (even if client-side bypassed)
if user.role == "Observer":
log_failed_authorization(user, incident_id, "approve_playbook")
raise PermissionError("Insufficient permissions")
# Permission check passed, proceed
execute_playbook(incident_id)Audit log includes:
- Who detected: SentienGuard anomaly engine
- What anomaly: postgres_active_connections = 98%
- What playbook: postgres_connection_reset v1.3.0
- Who approved: bob.chen@company.com (Remediation Authority)
- When approved: 2026-02-10T02:47:52Z
- What executed: Complete command log with outputs
- Outcome: Success, connection pool restored to 64%
- MTTR: 31 seconds (detection to resolution)Real Approval Workflow Result
2:47:44 AM
Detection
2:47:52 AM (8s)
Approval (Bob)
23 seconds
Execution
31 seconds
Total MTTR
Every Step Verified Before Proceeding
Execution without verification is hope, not confidence. Five verification types confirm desired state. Retry logic with exponential backoff handles slow startups. All checks must pass or rollback triggers.
Step 1: Clear temp files
- Command executed: find /tmp -mtime +7 -delete
- Exit code: 0 (success)
- Assumption: Disk space freed
Reality: /tmp was empty, no space freed
Result: Disk still 91%, incident not resolved
Problem: Didn't verify space actually freedStep 1: Clear temp files
- Command executed: find /tmp -mtime +7 -delete
- Exit code: 0 (success)
- Verification: Check disk usage
* Expected: <80%
* Actual: 91%
* Result: FAILED
Action: Move to Step 2 (try different cleanup method)
OR: Rollback and escalate (if no more steps)Five Verification Types
Type 1: Metric Verification
verification:
- type: metric
metric: disk_usage
threshold: "< 80%"
retry: 3
retry_delay: 10sType 2: HTTP Endpoint
verification:
- type: http
url: "http://localhost:8080/health"
expected_status: 200
timeout: 5s
retry: 3Type 3: Process Verification
verification:
- type: process
name: postgresql
state: running
min_uptime: 10s # Must be running for at least 10 secondsType 4: Custom Command
verification:
- type: command
command: "pg_isready -U postgres"
expected_exit_code: 0
timeout: 5sType 5: Log File Check
verification:
- type: log_check
file: /var/log/application.log
pattern: "ERROR|FATAL"
max_matches: 0 # No errors expected
lookback: 60s # Check last 60 seconds of logsRetry with Exponential Backoff
T+0s: Execute playbook (restart service)
T+3s: Attempt 1: Connection refused (service starting)
T+8s: Attempt 2: 503 Service Unavailable (still starting)
T+18s: Attempt 3: 503 Service Unavailable
T+33s: Attempt 4: 200 OK ← SUCCESS
Skip attempt 5, verification passed
Total verification time: 33 secondsComposite Verification (Multiple Checks)
verification:
- type: metric
metric: disk_usage
threshold: "< 80%"
- type: http
url: "http://localhost:8080/health"
expected_status: 200
- type: log_check
file: /var/log/app.log
pattern: "ERROR"
max_matches: 0
# ALL checks must pass for verification to succeedComposite Execution Result
Check 1: Disk usage
Current: 72.1%
Threshold: <80%
Result: PASS ✓
Check 2: HTTP health endpoint
Response: 200 OK
Result: PASS ✓
Check 3: Error logs
Matches: 0 errors
Max: 0
Result: PASS ✓
Overall: PASS (all 3 checks passed)Automatic Revert When Health Checks Fail
If verification fails, undo changes and return to previous state. Better to abort than leave system broken. Rollback triggers on failed health checks, step execution failures, precondition violations, and manual abort.
name: nginx_restart
steps:
- name: stop_nginx
command: systemctl stop nginx
rollback: systemctl start nginx
- name: clear_cache
command: rm -rf /var/cache/nginx/*
rollback: skip # Can't restore cache
- name: start_nginx
command: systemctl start nginx
rollback: systemctl stop nginx
verification:
- type: http
url: http://localhost:80/health
expected_status: 200
retry: 3Step 1: Stop nginx → SUCCESS
Step 2: Clear cache → SUCCESS
Step 3: Start nginx → SUCCESS
Verification: HTTP 200 → PASS
No rollback neededStep 1: Stop nginx → SUCCESS
Step 2: Clear cache → SUCCESS
Step 3: Start nginx → SUCCESS (exit code 0, but service crashes)
Verification: HTTP GET localhost:80/health
Attempt 1: Connection refused → FAIL
Attempt 2: Connection refused → FAIL
Attempt 3: Connection refused → FAIL
Verification FAILED after 3 retries
Rollback triggered:
Step 3 rollback: systemctl stop nginx
Step 2 rollback: skip (can't restore cache)
Step 1 rollback: systemctl start nginx
Result: nginx restarted (back to original state)
Status: Incident marked FAILED, escalated to humanRollback Limitations
Deleted Files
Permanent loss, can't restore without backup.
Solution: Design playbooks to be idempotent (safe to retry).
External API Calls
Already executed, can't un-launch instances.
Solution: Check if resources exist before creating.
Database Writes
Unless transaction-wrapped, can't undo.
Solution: Use database transactions or idempotent upserts.
Sent Notifications
Can't unsend messages.
Solution: Send notifications AFTER verification passes.
# Some steps can rollback, others can't
steps:
- name: stop_service
rollback: start_service # Can rollback
- name: delete_cache
rollback: skip # Can't rollback (files deleted)
- name: update_config
rollback: restore_old_config # Can rollback (config backed up)
# If verification fails:
# Rollback what's possible (restart service, restore config)
# Escalate with partial rollback reportsteps:
- name: database_migration
command: psql -f /migrations/v2.sql
timeout: 300s # 5 minutes
rollback: psql -f /migrations/v2_rollback.sql
rollback_timeout: 60s # 1 minute (rollback should be faster)
# Why: Rollback is emergency recovery.
# If it takes too long, system stays broken.Best Practices
Prefer Idempotency
Design playbooks safe to retry rather than relying on rollback.
Test Rollback Paths
Force verification failure in staging to confirm rollback works.
Partial Rollback
Rollback what's possible, escalate with partial report.
Rollback Timeout
Rollback should complete faster than forward execution.
SSH, Kubernetes, Cloud APIs
Three execution methods cover all infrastructure: SSH for Linux server operations, kubectl for Kubernetes, and cloud provider CLIs for AWS, GCP, and Azure resources.
SSH Commands
Linux server operations: disk cleanup, service restarts, file operations. Agent executes locally via SSH to localhost. SSH keys stored in Secrets Manager, retrieved just-in-time. Sudo allowed for specific commands only via /etc/sudoers.d/.
Kubernetes (kubectl)
Kubernetes operations: pod restarts, scaling, rollbacks. Uses ServiceAccount credentials with least-privilege RBAC. Namespace-scoped: production ServiceAccount can't touch staging.
Cloud Provider APIs
AWS CLI, gcloud, az CLI for cloud infrastructure. IAM roles (AWS), Service Principals (Azure), Service Accounts (GCP). No static API keys stored. Actions limited to playbook requirements.
- name: clear_temp_files
action: ssh_command
command: "find /tmp -type f -mtime +7 -delete"
user: sentienguard # SSH as this user
timeout: 60s
# Agent executes locally (SSH to localhost)
# Captures: exit code, stdout, stderr, duration
# Security: SSH keys from Secrets Manager (just-in-time)
# Sudo: allowed for specific commands only- name: restart_failed_pod
action: kubectl_delete
resource: pod
selector: "status=CrashLoopBackOff"
namespace: production
timeout: 30s
# Agent uses Kubernetes API via ServiceAccount
# What happens:
# 1. Kubernetes deletes matching pods
# 2. ReplicaSet/Deployment creates new pods
# 3. New pods start fresh (no inherited crash state)apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sentienguard-agent
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "delete"] # Pod restarts
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["get", "patch"] # Scaling, rollbacks- name: snapshot_ebs_volume
action: aws_cli
command: |
aws ec2 create-snapshot \
--volume-id vol-1234567890abcdef \
--description "Automated snapshot before maintenance"
timeout: 300s- name: scale_instance_group
action: gcloud
command: |
gcloud compute instance-groups managed resize my-group \
--size 10 \
--zone us-central1-a
timeout: 120s- name: restart_vm
action: az_cli
command: |
az vm restart \
--resource-group my-rg \
--name prod-vm-01
timeout: 180s| Method | Use Case | Auth | Speed | Complexity |
|---|---|---|---|---|
| SSH | Linux server ops | SSH keys | Fast (<5s) | Low |
| kubectl | Kubernetes ops | ServiceAccount | Fast (<10s) | Medium |
| AWS CLI | AWS resources | IAM role | Medium (10-60s) | Medium |
| gcloud | GCP resources | Service Account | Medium (10-60s) | Medium |
| az | Azure resources | Service Principal | Medium (10-60s) | Medium |
Common Questions
Playbooks execute steps sequentially. If Step 3 fails, Steps 1-2 already completed. Rollback executes in reverse: undo Step 2, undo Step 1. If rollback succeeds, system returns to pre-playbook state. If rollback fails, escalate to human with partial rollback report.
Yes. Three testing methods: (1) Dry-run mode (simulate execution, no actual commands run), (2) Staging environment (run on staging hosts first), (3) Manual trigger (engineer triggers playbook manually, reviews result before enabling autonomous execution).
Set approval requirement: approval_gate: required: true. Playbook will always require human approval via Slack, never runs autonomously. Or disable playbook entirely for specific hosts: exclusions: host_pattern: "*.prod.*".
Dashboard → Playbooks → disk_cleanup_prod_db → View YAML. Complete playbook definition visible, including all commands, verification checks, and rollback steps. Audit logs show exact commands executed historically.
No. Playbooks are immutable (version-controlled). Only Admins can create/edit playbooks via dashboard. Playbooks cannot self-modify or spawn new playbooks. Prevents runaway automation.
Playbooks only execute once per incident. After execution, incident marked "resolved" or "failed." Same incident never triggers same playbook twice. If problem recurs, new incident created (different incident ID).
See Automated Remediation in Action
Deploy agents, import disk_cleanup playbook, fill disk to 90%, watch autonomous execution, review health verification, examine audit log.
name: disk_cleanup_test
steps:
- name: create_large_file
command: "dd if=/dev/zero of=/tmp/testfile bs=1M count=1000"
- name: verify_disk_high
command: "df -h / | awk 'NR==2 {print $5}'"
- name: delete_large_file
command: "rm /tmp/testfile"
- name: verify_disk_normal
command: "df -h / | awk 'NR==2 {print $5}'"
Free tier: 3 nodes, 50+ pre-built playbooks, full RBAC, complete audit logs, no credit card.