SentienGuard
Home>Product>Automated Remediation

Automated Remediation

Execute. Verify. Rollback. Zero Manual Intervention.

Idempotent playbooks execute via SSH, kubectl, and cloud APIs. Health verification after every step. Automatic rollback on failure. RBAC prevents runaway automation. <60 second typical execution for routine infrastructure fixes.

<60sTypical execution timeRoutine infrastructure fixes
100%Idempotency guaranteeSafe to retry, no side effects
3 rolesRBAC enforcementObserver, Remediation Authority, Admin
AutoRollback on failureHealth check fails → revert changes

From Playbook Selection to Verified Resolution

Five-stage pipeline: receive signed playbook, verify preconditions, execute steps sequentially, verify health, rollback on failure. Every step logged, every action auditable.

1

Stage 1: Playbook Reception

<50ms

Control plane sends signed playbook to agent. Agent verifies Ed25519 cryptographic signature, checks timestamp freshness (<5 min), confirms target host matches, and validates playbook version. If any check fails: reject and log failed authorization attempt.

2

Stage 2: Pre-Execution Checks

<100ms

Verify resource availability (disk, CPU, memory headroom), dependency checks (required binaries, service accounts, secrets), and conflict detection (no other playbook running, no maintenance window active, host not marked do-not-remediate). If preconditions fail: abort, report, escalate.

3

Stage 3: Step-by-Step Execution

10-90s

Execute steps sequentially (not parallel). Each step captures stdout, stderr, exit code. Step 2 depends on Step 1 completing. Health verification after each step. Timeout enforcement per step (default 60s). If step exceeds timeout: kill, mark failed, trigger rollback.

4

Stage 4: Health Verification

10-30s

Verify desired state achieved: metric thresholds met, HTTP endpoints healthy, no new errors in logs, performance within bounds. Retry logic with exponential backoff (3 attempts). All checks must pass. If any check fails after retries: trigger rollback.

5

Stage 5: Rollback on Failure

10-60s

If verification fails: execute rollback steps in reverse order, verify rollback successful, report failure to control plane, escalate to human. Operations that can't roll back (deleted files, external API calls) are skipped. Best practice: design for idempotency over rollback.

Stage 1: Signed Playbook Payload
{
  "playbook": "disk_cleanup_prod_db",
  "version": "1.4.2",
  "incident_id": "inc_2026_02_10_1435",
  "target_host": "prod-db-03.us-east-1",
  "timestamp": "2026-02-10T14:35:43.124Z",
  "signature": "ed25519:a8f3b2c1d9e4f5a6b7c8d9e0...",
  "steps": [...]
}
Stage 1: Agent Verification
1. Verify cryptographic signature (Ed25519 public key)
2. Check timestamp freshness (must be <5 minutes old)
3. Verify target host matches agent's hostname
4. Confirm playbook version exists in cache

If ANY check fails: Reject, log failed authorization attempt
If all checks pass: Proceed to execution
Stage 2: Pre-Execution Checks
Before executing ANY step, verify preconditions:

1. Resource availability:
   - Disk space sufficient for logs/temp files
   - CPU headroom available (not already at 100%)
   - Memory available for execution

2. Dependency checks:
   - Required binaries present (kubectl, aws, gcloud)
   - Service accounts configured
   - Secrets accessible (SSH keys, API tokens)

3. Conflict detection:
   - No other playbook currently executing (serialized)
   - No maintenance windows active
   - Host not marked as "do not remediate"

If preconditions fail: Abort, report, escalate to human
If preconditions pass: Begin execution
Stage 3: Timeout Enforcement
# Each step has timeout
steps:
  - name: clear_temp_files
    timeout: 60s  # Max 60 seconds

  - name: rotate_logs
    timeout: 60s

# If step exceeds timeout: KILL, mark failed, trigger rollback
Stage 3: Step-by-Step Execution
Execute playbook steps sequentially (not parallel):

Step 1: Clear temp files
  Command: find /tmp -type f -mtime +7 -delete
  Execution: SSH to localhost, run command
  Capture: stdout, stderr, exit code
  Duration: 3.8 seconds
  Result: 100 files deleted, 8.3 GB freed

Step 2: Rotate logs
  Command: logrotate -f /etc/logrotate.conf
  Execution: SSH to localhost, run command
  Capture: stdout, stderr, exit code
  Duration: 1.9 seconds
  Result: 12 log files rotated, 3.1 GB freed

Step 3: Verify space freed
  Command: df -h / | awk 'NR==2 {print $5}' | sed 's/%//'
  Capture: stdout = "72" (72% disk usage)
  Duration: 0.2 seconds
  Result: Disk usage 72% (down from 91%)
Stage 4: Health Verification
After execution completes, verify desired state:

Verification checks:
  1. Metric check: disk_usage < 80%
  2. Service health: HTTP 200 on health endpoint
  3. Error log check: No new errors in /var/log/syslog
  4. Performance check: Disk I/O latency <20ms

Health verification result:
  - Disk usage: 72.1% ✓ (< 80% threshold)
  - HTTP health: 200 OK ✓
  - Error logs: 0 new errors ✓
  - Disk latency: 3ms ✓ (<20ms threshold)

Overall: PASS (all checks passed)
Stage 5: Rollback on Failure
Verification failed: Disk usage 89% (expected <80%)

Rollback triggered:
  1. Identify rollback steps from playbook
  2. Execute rollback steps in reverse order
  3. Verify rollback successful
  4. Report failure to control plane
  5. Escalate to human

Problem: Disk cleanup is one-way (can't restore deleted files)
Action: Escalate to human, incident marked FAILED
Total Pipeline Latency
Reception:           50ms
Precondition checks: 100ms
Execution:           87s (example: disk cleanup)
Health verification: 20s (3 retries)
Rollback:            0s (verification passed, no rollback needed)

Total: 107 seconds (typical)

Running a Playbook Twice = Same Result as Running Once

Idempotent operations produce the same result regardless of how many times executed. Network failures, timeouts, retries must not cause duplicate actions or side effects. Every SentienGuard playbook enforces idempotency through state checks, conditional execution, and health verification.

Idempotency Definition
Idempotent function f(x):
  f(x) = result
  f(f(x)) = result  (applying twice = same as applying once)
  f(f(f(x))) = result  (applying N times = same result)

Non-idempotent function g(x):
  g(x) = result_1
  g(g(x)) = result_2  (different result!)
  g(g(g(x))) = result_3  (keeps changing)
Real Example: Disk Cleanup
First Execution:
  Files found: 1,247 files (8.3 GB total)
  Files deleted: 1,247 files
  Space freed: 8.3 GB
  Disk usage: 91% → 72%

Second Execution (Immediate Retry):
  Files found: 0 files (already deleted)
  Files deleted: 0 files
  Space freed: 0 GB
  Disk usage: 72% → 72% (no change)

Third Execution:
  Files found: 0 files
  Disk usage: 72% → 72% (still no change)

Result: Idempotent. Running 1 time or 100 times = same outcome.

Why Idempotency Matters

Scenario: Network Timeout During Execution
Agent executes Step 2 of 4: Clear temp files
Network connection lost (timeout)
Control plane doesn't receive completion ACK

Question: Did Step 2 complete?
  - Maybe yes (completed, but ACK lost in network)
  - Maybe no (failed mid-execution)

Safe action: Re-run entire playbook from Step 1

If idempotent:
  Step 1 (re-run): No-op (already done)
  Step 2 (re-run): No-op (already done)
  Step 3: Executes normally
  Step 4: Executes normally
  Result: Success, no duplicate actions

If NOT idempotent:
  Step 1 (re-run): Duplicate action (BAD)
  Step 2 (re-run): Duplicate action (BAD)
  Result: Side effects, unintended consequences
Scenario: Manual Engineer Re-Trigger
Engineer reviews incident dashboard:
  "Did disk_cleanup_prod_db run? I don't see confirmation."

Engineer clicks: "Re-run playbook manually"

If idempotent:
  Playbook checks disk usage: 72% (already fixed)
  Conditional: disk_usage > 85% = FALSE
  No action taken
  Result: Safe, no harm done

If NOT idempotent:
  Playbook runs again, deletes more files
  Result: Accidental damage from double-execution

How We Enforce Idempotency

Mechanism 1: State Checks Before Execution

Conditional Execution
# Playbook includes conditional checks
- name: clear_temp_files
  action: ssh_command
  command: "find /tmp -type f -mtime +7 -delete"
  only_if: "disk_usage > 85%"  # Skip if already below threshold

Mechanism 3: Health Verification as Exit Condition

Idempotent Execution Logic
def execute_playbook_idempotent(playbook):
    # Check if already in desired state
    if verify_health(playbook.verification):
        log("Already in desired state, skipping execution")
        return "SUCCESS_NO_ACTION"

    # Not in desired state, execute playbook
    execute_steps(playbook.steps)

    # Verify we reached desired state
    if verify_health(playbook.verification):
        return "SUCCESS"
    else:
        return "FAILED"

# First run:  disk 91% → execute → disk 72% → SUCCESS
# Second run: disk 72% → already healthy → SUCCESS_NO_ACTION

Mechanism 2: Evaluation Logic

State Check Logic
def execute_step(step):
    # Check condition before running
    if step.only_if:
        condition_met = evaluate_condition(step.only_if)
        if not condition_met:
            log("Skipping step: condition not met")
            return "SKIPPED"

    # Condition met (or no condition), execute
    result = run_command(step.command)
    return result

# First run:
#   Condition: disk_usage > 85%, Current: 91%
#   91 > 85 = TRUE → Execute cleanup
# Second run (retry):
#   Condition: disk_usage > 85%, Current: 72%
#   72 > 85 = FALSE → Skip (already done)

Mechanism 4: Audit Trail Marks Retries

Retry Tracking
{
  "incident_id": "inc_2026_02_10_1435",
  "playbook": "disk_cleanup_prod_db",
  "executions": [
    {
      "execution_id": "exec_001",
      "timestamp": "2026-02-10T14:35:43Z",
      "result": "success",
      "duration": 87
    },
    {
      "execution_id": "exec_002",
      "timestamp": "2026-02-10T14:38:15Z",
      "result": "success_no_action",
      "duration": 2,
      "reason": "retry_after_timeout"
    }
  ]
}
Anti-Patterns (NOT Idempotent — We Reject)

Incrementing Counters

echo $((counter + 1)) > /var/retry_count

counter = 1, then 2, then 3. Side effects accumulate.

Appending to Files

echo 'Incident resolved' >> /var/log/incidents.log

Duplicate log entries on each run.

Sending Notifications

sendmail -t ops@company.com -s 'Disk cleaned'

Duplicate emails on each retry.

DB Inserts Without Check

psql -c "INSERT INTO incidents (id, status) VALUES (123, 'resolved')"

Duplicate key error or duplicate rows.

Correct Patterns (Idempotent — We Use)

Delete Files Matching Criteria

find /tmp -type f -mtime +7 -delete

Run 1: deletes 100 files. Run 2: deletes 0 (already gone). Same outcome.

Set State (Not Increment)

setquota -u postgres 100G 110G 0 0 /

Run 1: sets quota. Run 2: quota already set. No change.

Conditional Restarts

systemctl restart nginx  # only_if: service_status != active

Run 1: restarts (was failed). Run 2: skips (already active).

Upsert (Update or Insert)

INSERT INTO incidents ... ON CONFLICT (id) DO UPDATE SET status = 'resolved'

Run 1: inserts row. Run 2: updates same row. Single row, correct state.

Three Roles Prevent Runaway Automation

Autonomous systems without access controls are dangerous: junior engineers accidentally triggering production playbooks, compromised accounts executing malicious actions, bugs causing cascading failures. RBAC enforces who can see, who can approve, and who can configure.

Observer

Read-Only Access

Allowed

  • View incidents, playbooks, audit logs, dashboards
  • Search historical incidents
  • Export reports (audit logs, compliance evidence)
  • View playbook execution results

Denied

  • Cannot approve playbook execution
  • Cannot trigger playbooks manually
  • Cannot modify playbooks or system settings
  • Cannot manage users or roles

Use cases: Junior engineers, auditors, executives, external consultants, interns, contractors.

Remediation Authority

Approve & Execute

Allowed

  • All Observer permissions
  • Approve playbook execution via Slack approval gates
  • Manually trigger playbooks from dashboard
  • Override autonomous decisions (stop execution mid-flight)
  • Acknowledge and close incidents

Denied

  • Cannot create, edit, or delete playbooks
  • Cannot manage users or assign roles
  • Cannot modify system settings (thresholds, integrations)

Use cases: Senior SREs on-call, DevOps engineers managing production, infrastructure team leads.

Administrator

Full Control

Allowed

  • All Remediation Authority permissions
  • Create, edit, delete playbooks
  • Manage users (add, remove, assign roles)
  • Configure integrations (Slack, AWS, GCP, Azure)
  • Modify system settings (thresholds, environments, anomaly detection)
  • Access billing and subscription management

Use cases: DevOps team leads, platform engineering managers, CTO, system administrators.

Environment-Specific RBAC

Per-Environment Role Assignment
users:
  - email: alice.jones@company.com
    roles:
      production: observer           # Read-only in prod
      staging: administrator         # Full control in staging
      dev: administrator             # Full control in dev

  - email: bob.chen@company.com
    roles:
      production: remediation_authority  # Can approve prod fixes
      staging: remediation_authority
      dev: observer                  # Not involved in dev

  - email: charlie.wang@company.com
    roles:
      production: observer           # External auditor
      staging: observer              # Read-only everywhere
      dev: observer

Slack Approval Gate

Approval Request Message
[SentienGuard] Approval Required

Incident: prod-db-03 connection pool exhausted (98%)
Playbook: postgres_connection_reset (confidence 0.94)

This will:
  1. Kill idle connections older than 1 hour (estimated: 23)
  2. Reset connection pool to default limits
  3. Verify new connections successful

Estimated duration: 23 seconds
Risk: May interrupt long-running analytical queries

[Approve] [Deny] [View Details]
Successful Approval Log
{
  "approved_by": "bob.chen@company.com",
  "role": "Remediation Authority",
  "timestamp": "2026-02-10T02:47:52.187Z",
  "ip_address": "203.0.113.42",
  "approval_latency_seconds": 8,
  "incident_id": "inc_2026_02_10_0247",
  "playbook": "postgres_connection_reset",
  "confidence": 0.94
}
Failed Authorization Attempt
{
  "timestamp": "2026-02-10T02:48:00Z",
  "user": "alice.jones@company.com",
  "role": "Observer",
  "action": "approve_playbook",
  "incident_id": "inc_2026_02_10_0247",
  "result": "denied_insufficient_permissions",
  "ip_address": "203.0.113.42"
}
Server-Side Enforcement
def approve_playbook(user, incident_id):
    # Server-side enforcement (even if client-side bypassed)
    if user.role == "Observer":
        log_failed_authorization(user, incident_id, "approve_playbook")
        raise PermissionError("Insufficient permissions")

    # Permission check passed, proceed
    execute_playbook(incident_id)
Complete Audit Trail (End-to-End)
Audit log includes:
  - Who detected: SentienGuard anomaly engine
  - What anomaly: postgres_active_connections = 98%
  - What playbook: postgres_connection_reset v1.3.0
  - Who approved: bob.chen@company.com (Remediation Authority)
  - When approved: 2026-02-10T02:47:52Z
  - What executed: Complete command log with outputs
  - Outcome: Success, connection pool restored to 64%
  - MTTR: 31 seconds (detection to resolution)

Real Approval Workflow Result

2:47:44 AM

Detection

2:47:52 AM (8s)

Approval (Bob)

23 seconds

Execution

31 seconds

Total MTTR

Every Step Verified Before Proceeding

Execution without verification is hope, not confidence. Five verification types confirm desired state. Retry logic with exponential backoff handles slow startups. All checks must pass or rollback triggers.

Without Verification
Exit Code 0 ≠ Problem Solved
Step 1: Clear temp files
  - Command executed: find /tmp -mtime +7 -delete
  - Exit code: 0 (success)
  - Assumption: Disk space freed

Reality: /tmp was empty, no space freed
Result: Disk still 91%, incident not resolved
Problem: Didn't verify space actually freed
With Verification
Verify Actual State Change
Step 1: Clear temp files
  - Command executed: find /tmp -mtime +7 -delete
  - Exit code: 0 (success)
  - Verification: Check disk usage
    * Expected: <80%
    * Actual: 91%
    * Result: FAILED

Action: Move to Step 2 (try different cleanup method)
  OR: Rollback and escalate (if no more steps)

Five Verification Types

Type 1: Metric Verification

Metric Check
verification:
  - type: metric
    metric: disk_usage
    threshold: "< 80%"
    retry: 3
    retry_delay: 10s

Type 2: HTTP Endpoint

HTTP Health Check
verification:
  - type: http
    url: "http://localhost:8080/health"
    expected_status: 200
    timeout: 5s
    retry: 3

Type 3: Process Verification

Process State Check
verification:
  - type: process
    name: postgresql
    state: running
    min_uptime: 10s  # Must be running for at least 10 seconds

Type 4: Custom Command

Command Exit Code
verification:
  - type: command
    command: "pg_isready -U postgres"
    expected_exit_code: 0
    timeout: 5s

Type 5: Log File Check

Error Log Scan
verification:
  - type: log_check
    file: /var/log/application.log
    pattern: "ERROR|FATAL"
    max_matches: 0  # No errors expected
    lookback: 60s   # Check last 60 seconds of logs

Retry with Exponential Backoff

Retry Timeline
T+0s:  Execute playbook (restart service)
T+3s:  Attempt 1: Connection refused (service starting)
T+8s:  Attempt 2: 503 Service Unavailable (still starting)
T+18s: Attempt 3: 503 Service Unavailable
T+33s: Attempt 4: 200 OK ← SUCCESS
       Skip attempt 5, verification passed

Total verification time: 33 seconds

Composite Verification (Multiple Checks)

All Checks Must Pass
verification:
  - type: metric
    metric: disk_usage
    threshold: "< 80%"

  - type: http
    url: "http://localhost:8080/health"
    expected_status: 200

  - type: log_check
    file: /var/log/app.log
    pattern: "ERROR"
    max_matches: 0

# ALL checks must pass for verification to succeed

Composite Execution Result

Pass/Fail per Check
Check 1: Disk usage
  Current: 72.1%
  Threshold: <80%
  Result: PASS ✓

Check 2: HTTP health endpoint
  Response: 200 OK
  Result: PASS ✓

Check 3: Error logs
  Matches: 0 errors
  Max: 0
  Result: PASS ✓

Overall: PASS (all 3 checks passed)

Automatic Revert When Health Checks Fail

If verification fails, undo changes and return to previous state. Better to abort than leave system broken. Rollback triggers on failed health checks, step execution failures, precondition violations, and manual abort.

Example: nginx_restart Playbook
name: nginx_restart
steps:
  - name: stop_nginx
    command: systemctl stop nginx
    rollback: systemctl start nginx

  - name: clear_cache
    command: rm -rf /var/cache/nginx/*
    rollback: skip  # Can't restore cache

  - name: start_nginx
    command: systemctl start nginx
    rollback: systemctl stop nginx

verification:
  - type: http
    url: http://localhost:80/health
    expected_status: 200
    retry: 3
Success Path
Verification Passes
Step 1: Stop nginx → SUCCESS
Step 2: Clear cache → SUCCESS
Step 3: Start nginx → SUCCESS
Verification: HTTP 200 → PASS

No rollback needed
Failure + Rollback
Verification Fails → Rollback
Step 1: Stop nginx → SUCCESS
Step 2: Clear cache → SUCCESS
Step 3: Start nginx → SUCCESS (exit code 0, but service crashes)
Verification: HTTP GET localhost:80/health
  Attempt 1: Connection refused → FAIL
  Attempt 2: Connection refused → FAIL
  Attempt 3: Connection refused → FAIL

Verification FAILED after 3 retries

Rollback triggered:
  Step 3 rollback: systemctl stop nginx
  Step 2 rollback: skip (can't restore cache)
  Step 1 rollback: systemctl start nginx

Result: nginx restarted (back to original state)
Status: Incident marked FAILED, escalated to human

Rollback Limitations

Deleted Files

Permanent loss, can't restore without backup.

Solution: Design playbooks to be idempotent (safe to retry).

External API Calls

Already executed, can't un-launch instances.

Solution: Check if resources exist before creating.

Database Writes

Unless transaction-wrapped, can't undo.

Solution: Use database transactions or idempotent upserts.

Sent Notifications

Can't unsend messages.

Solution: Send notifications AFTER verification passes.

Partial Rollback (Mixed Steps)
# Some steps can rollback, others can't
steps:
  - name: stop_service
    rollback: start_service    # Can rollback

  - name: delete_cache
    rollback: skip             # Can't rollback (files deleted)

  - name: update_config
    rollback: restore_old_config  # Can rollback (config backed up)

# If verification fails:
#   Rollback what's possible (restart service, restore config)
#   Escalate with partial rollback report
Rollback Timeout Enforcement
steps:
  - name: database_migration
    command: psql -f /migrations/v2.sql
    timeout: 300s  # 5 minutes
    rollback: psql -f /migrations/v2_rollback.sql
    rollback_timeout: 60s  # 1 minute (rollback should be faster)

# Why: Rollback is emergency recovery.
# If it takes too long, system stays broken.

Best Practices

Prefer Idempotency

Design playbooks safe to retry rather than relying on rollback.

Test Rollback Paths

Force verification failure in staging to confirm rollback works.

Partial Rollback

Rollback what's possible, escalate with partial report.

Rollback Timeout

Rollback should complete faster than forward execution.

SSH, Kubernetes, Cloud APIs

Three execution methods cover all infrastructure: SSH for Linux server operations, kubectl for Kubernetes, and cloud provider CLIs for AWS, GCP, and Azure resources.

SSH Commands

Linux server operations: disk cleanup, service restarts, file operations. Agent executes locally via SSH to localhost. SSH keys stored in Secrets Manager, retrieved just-in-time. Sudo allowed for specific commands only via /etc/sudoers.d/.

Kubernetes (kubectl)

Kubernetes operations: pod restarts, scaling, rollbacks. Uses ServiceAccount credentials with least-privilege RBAC. Namespace-scoped: production ServiceAccount can't touch staging.

Cloud Provider APIs

AWS CLI, gcloud, az CLI for cloud infrastructure. IAM roles (AWS), Service Principals (Azure), Service Accounts (GCP). No static API keys stored. Actions limited to playbook requirements.

Method 1: SSH Commands
- name: clear_temp_files
  action: ssh_command
  command: "find /tmp -type f -mtime +7 -delete"
  user: sentienguard  # SSH as this user
  timeout: 60s

# Agent executes locally (SSH to localhost)
# Captures: exit code, stdout, stderr, duration
# Security: SSH keys from Secrets Manager (just-in-time)
# Sudo: allowed for specific commands only
Method 2: Kubernetes (kubectl)
- name: restart_failed_pod
  action: kubectl_delete
  resource: pod
  selector: "status=CrashLoopBackOff"
  namespace: production
  timeout: 30s

# Agent uses Kubernetes API via ServiceAccount
# What happens:
#   1. Kubernetes deletes matching pods
#   2. ReplicaSet/Deployment creates new pods
#   3. New pods start fresh (no inherited crash state)
Kubernetes RBAC Permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sentienguard-agent
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "delete"]  # Pod restarts

  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["get", "patch"]  # Scaling, rollbacks
Method 3: AWS CLI
- name: snapshot_ebs_volume
  action: aws_cli
  command: |
    aws ec2 create-snapshot \
      --volume-id vol-1234567890abcdef \
      --description "Automated snapshot before maintenance"
  timeout: 300s
Method 3: GCP CLI
- name: scale_instance_group
  action: gcloud
  command: |
    gcloud compute instance-groups managed resize my-group \
      --size 10 \
      --zone us-central1-a
  timeout: 120s
Method 3: Azure CLI
- name: restart_vm
  action: az_cli
  command: |
    az vm restart \
      --resource-group my-rg \
      --name prod-vm-01
  timeout: 180s
MethodUse CaseAuthSpeedComplexity
SSHLinux server opsSSH keysFast (<5s)Low
kubectlKubernetes opsServiceAccountFast (<10s)Medium
AWS CLIAWS resourcesIAM roleMedium (10-60s)Medium
gcloudGCP resourcesService AccountMedium (10-60s)Medium
azAzure resourcesService PrincipalMedium (10-60s)Medium

Common Questions

Playbooks execute steps sequentially. If Step 3 fails, Steps 1-2 already completed. Rollback executes in reverse: undo Step 2, undo Step 1. If rollback succeeds, system returns to pre-playbook state. If rollback fails, escalate to human with partial rollback report.

Yes. Three testing methods: (1) Dry-run mode (simulate execution, no actual commands run), (2) Staging environment (run on staging hosts first), (3) Manual trigger (engineer triggers playbook manually, reviews result before enabling autonomous execution).

Set approval requirement: approval_gate: required: true. Playbook will always require human approval via Slack, never runs autonomously. Or disable playbook entirely for specific hosts: exclusions: host_pattern: "*.prod.*".

Dashboard → Playbooks → disk_cleanup_prod_db → View YAML. Complete playbook definition visible, including all commands, verification checks, and rollback steps. Audit logs show exact commands executed historically.

No. Playbooks are immutable (version-controlled). Only Admins can create/edit playbooks via dashboard. Playbooks cannot self-modify or spawn new playbooks. Prevents runaway automation.

Playbooks only execute once per incident. After execution, incident marked "resolved" or "failed." Same incident never triggers same playbook twice. If problem recurs, new incident created (different incident ID).

See Automated Remediation in Action

Deploy agents, import disk_cleanup playbook, fill disk to 90%, watch autonomous execution, review health verification, examine audit log.

Test Playbook
name: disk_cleanup_test
steps:
  - name: create_large_file
    command: "dd if=/dev/zero of=/tmp/testfile bs=1M count=1000"
  - name: verify_disk_high
    command: "df -h / | awk 'NR==2 {print $5}'"
  - name: delete_large_file
    command: "rm /tmp/testfile"
  - name: verify_disk_normal
    command: "df -h / | awk 'NR==2 {print $5}'"

Free tier: 3 nodes, 50+ pre-built playbooks, full RBAC, complete audit logs, no credit card.