What is automated remediation?

Automated remediation is the practice of executing a pre-validated fix in response to a detected incident — without human intervention. Modern systems pair anomaly detection with playbook selection (rules, ML, or RAG), execute the chosen playbook, verify the result, and log every action immutably.

What is autonomous remediation vs automated remediation?

The terms are often used interchangeably. "Automated" implies a human picked the action; "autonomous" implies the system selected and executed without a human in the decision loop. SentienGuard supports both modes: approval-gated (semi-autonomous) and confidence-gated (fully autonomous).

How does SentienGuard's AI remediation execute fixes?

Playbooks run via the appropriate channel for each action: SSH for VM-level commands, kubectl for Kubernetes resources, cloud-provider APIs (AWS, GCP, Azure) for managed services, plus REST hooks for arbitrary HTTP endpoints. All playbooks are idempotent — re-running them on a healthy system has no effect.

What stops automated remediation from making things worse?

Three layers. 1) Confidence gating — only playbooks above the configured confidence threshold run autonomously. 2) Verification — after every action, SentienGuard re-checks the original anomaly and any dependent thresholds. 3) Rollback — if verification fails, the action is reverted automatically and the incident escalates to a human with full context.

How are playbooks authored and reviewed?

Playbooks are YAML, versioned in git, and reviewed through your normal CI/CD pipeline. New playbooks start in approval mode (preview-in-Slack, human approves). After a track record of safe runs, they're promoted to autonomous. The full review history is part of the audit log.

Does automated remediation require root access on production?

It requires whatever the playbook needs. Many actions (kubectl pod delete, logrotate, certbot renew) need elevated permissions. SentienGuard uses scoped, time-bound credentials issued via your existing identity provider — playbooks only get the permissions they declare, only for the duration of the action.

Can I limit which infrastructure SentienGuard can touch?

Yes. RBAC scopes playbook execution to specific clusters, namespaces, accounts, or services. Many teams start by enabling autonomous remediation only on staging + non-revenue services, then expand as confidence grows.

What's logged for each remediation action?

Every signal that triggered, the RAG selection candidates + chosen playbook + confidence score, the executed commands and their output, the verification result, and any rollback. The log is append-only and hash-chained for tamper-evidence — sufficient for SOC 2 CC7.x, HIPAA §164.312(b), PCI-DSS 10.x, and GDPR Article 30 evidence.

Automated Remediation

Execute. Verify. Rollback. Zero Manual Intervention.

Idempotent playbooks execute via SSH, kubectl, and cloud APIs. Health verification after every step. Automatic rollback on failure. RBAC prevents runaway automation. <60 second typical execution for routine infrastructure fixes.

Start Free (3 Nodes)See Execution Pipeline →

<60sTypical execution timeRoutine infrastructure fixes

100%Idempotency guaranteeSafe to retry, no side effects

3 rolesRBAC enforcementObserver, Remediation Authority, Admin

AutoRollback on failureHealth check fails → revert changes

From Playbook Selection to Verified Resolution

Five-stage pipeline: receive signed playbook, verify preconditions, execute steps sequentially, verify health, rollback on failure. Every step logged, every action auditable.

Stage 1: Playbook Reception

<50ms

Control plane sends signed playbook to agent. Agent verifies Ed25519 cryptographic signature, checks timestamp freshness (<5 min), confirms target host matches, and validates playbook version. If any check fails: reject and log failed authorization attempt.

Stage 2: Pre-Execution Checks

<100ms

Verify resource availability (disk, CPU, memory headroom), dependency checks (required binaries, service accounts, secrets), and conflict detection (no other playbook running, no maintenance window active, host not marked do-not-remediate). If preconditions fail: abort, report, escalate.

Stage 3: Step-by-Step Execution

10-90s

Execute steps sequentially (not parallel). Each step captures stdout, stderr, exit code. Step 2 depends on Step 1 completing. Health verification after each step. Timeout enforcement per step (default 60s). If step exceeds timeout: kill, mark failed, trigger rollback.

Stage 4: Health Verification

10-30s

Verify desired state achieved: metric thresholds met, HTTP endpoints healthy, no new errors in logs, performance within bounds. Retry logic with exponential backoff (3 attempts). All checks must pass. If any check fails after retries: trigger rollback.

Stage 5: Rollback on Failure

10-60s

If verification fails: execute rollback steps in reverse order, verify rollback successful, report failure to control plane, escalate to human. Operations that can't roll back (deleted files, external API calls) are skipped. Best practice: design for idempotency over rollback.

Stage 1: Signed Playbook Payload

{
  "playbook": "disk_cleanup_prod_db",
  "version": "1.4.2",
  "incident_id": "inc_2026_02_10_1435",
  "target_host": "prod-db-03.us-east-1",
  "timestamp": "2026-02-10T14:35:43.124Z",
  "signature": "ed25519:a8f3b2c1d9e4f5a6b7c8d9e0...",
  "steps": [...]
}

Stage 1: Agent Verification

1. Verify cryptographic signature (Ed25519 public key)
2. Check timestamp freshness (must be <5 minutes old)
3. Verify target host matches agent's hostname
4. Confirm playbook version exists in cache

If ANY check fails: Reject, log failed authorization attempt
If all checks pass: Proceed to execution

Stage 2: Pre-Execution Checks

Before executing ANY step, verify preconditions:

1. Resource availability:
   - Disk space sufficient for logs/temp files
   - CPU headroom available (not already at 100%)
   - Memory available for execution

2. Dependency checks:
   - Required binaries present (kubectl, aws, gcloud)
   - Service accounts configured
   - Secrets accessible (SSH keys, API tokens)

3. Conflict detection:
   - No other playbook currently executing (serialized)
   - No maintenance windows active
   - Host not marked as "do not remediate"

If preconditions fail: Abort, report, escalate to human
If preconditions pass: Begin execution

Stage 3: Timeout Enforcement

# Each step has timeout
steps:
  - name: clear_temp_files
    timeout: 60s  # Max 60 seconds

  - name: rotate_logs
    timeout: 60s

# If step exceeds timeout: KILL, mark failed, trigger rollback

Stage 3: Step-by-Step Execution

Execute playbook steps sequentially (not parallel):

Step 1: Clear temp files
  Command: find /tmp -type f -mtime +7 -delete
  Execution: SSH to localhost, run command
  Capture: stdout, stderr, exit code
  Duration: 3.8 seconds
  Result: 100 files deleted, 8.3 GB freed

Step 2: Rotate logs
  Command: logrotate -f /etc/logrotate.conf
  Execution: SSH to localhost, run command
  Capture: stdout, stderr, exit code
  Duration: 1.9 seconds
  Result: 12 log files rotated, 3.1 GB freed

Step 3: Verify space freed
  Command: df -h / | awk 'NR==2 {print $5}' | sed 's/%//'
  Capture: stdout = "72" (72% disk usage)
  Duration: 0.2 seconds
  Result: Disk usage 72% (down from 91%)

Stage 4: Health Verification

After execution completes, verify desired state:

Verification checks:
  1. Metric check: disk_usage < 80%
  2. Service health: HTTP 200 on health endpoint
  3. Error log check: No new errors in /var/log/syslog
  4. Performance check: Disk I/O latency <20ms

Health verification result:
  - Disk usage: 72.1% ✓ (< 80% threshold)
  - HTTP health: 200 OK ✓
  - Error logs: 0 new errors ✓
  - Disk latency: 3ms ✓ (<20ms threshold)

Overall: PASS (all checks passed)

Stage 5: Rollback on Failure

Verification failed: Disk usage 89% (expected <80%)

Rollback triggered:
  1. Identify rollback steps from playbook
  2. Execute rollback steps in reverse order
  3. Verify rollback successful
  4. Report failure to control plane
  5. Escalate to human

Problem: Disk cleanup is one-way (can't restore deleted files)
Action: Escalate to human, incident marked FAILED

Total Pipeline Latency

Reception:           50ms
Precondition checks: 100ms
Execution:           87s (example: disk cleanup)
Health verification: 20s (3 retries)
Rollback:            0s (verification passed, no rollback needed)

Total: 107 seconds (typical)

Running a Playbook Twice = Same Result as Running Once

Idempotent operations produce the same result regardless of how many times executed. Network failures, timeouts, retries must not cause duplicate actions or side effects. Every SentienGuard playbook enforces idempotency through state checks, conditional execution, and health verification.

Idempotency Definition

Idempotent function f(x):
  f(x) = result
  f(f(x)) = result  (applying twice = same as applying once)
  f(f(f(x))) = result  (applying N times = same result)

Non-idempotent function g(x):
  g(x) = result_1
  g(g(x)) = result_2  (different result!)
  g(g(g(x))) = result_3  (keeps changing)

Real Example: Disk Cleanup

First Execution:
  Files found: 1,247 files (8.3 GB total)
  Files deleted: 1,247 files
  Space freed: 8.3 GB
  Disk usage: 91% → 72%

Second Execution (Immediate Retry):
  Files found: 0 files (already deleted)
  Files deleted: 0 files
  Space freed: 0 GB
  Disk usage: 72% → 72% (no change)

Third Execution:
  Files found: 0 files
  Disk usage: 72% → 72% (still no change)

Result: Idempotent. Running 1 time or 100 times = same outcome.

Why Idempotency Matters

Scenario: Network Timeout During Execution

Agent executes Step 2 of 4: Clear temp files
Network connection lost (timeout)
Control plane doesn't receive completion ACK

Question: Did Step 2 complete?
  - Maybe yes (completed, but ACK lost in network)
  - Maybe no (failed mid-execution)

Safe action: Re-run entire playbook from Step 1

If idempotent:
  Step 1 (re-run): No-op (already done)
  Step 2 (re-run): No-op (already done)
  Step 3: Executes normally
  Step 4: Executes normally
  Result: Success, no duplicate actions

If NOT idempotent:
  Step 1 (re-run): Duplicate action (BAD)
  Step 2 (re-run): Duplicate action (BAD)
  Result: Side effects, unintended consequences

Scenario: Manual Engineer Re-Trigger

Engineer reviews incident dashboard:
  "Did disk_cleanup_prod_db run? I don't see confirmation."

Engineer clicks: "Re-run playbook manually"

If idempotent:
  Playbook checks disk usage: 72% (already fixed)
  Conditional: disk_usage > 85% = FALSE
  No action taken
  Result: Safe, no harm done

If NOT idempotent:
  Playbook runs again, deletes more files
  Result: Accidental damage from double-execution

How We Enforce Idempotency

Mechanism 1: State Checks Before Execution

Conditional Execution

# Playbook includes conditional checks
- name: clear_temp_files
  action: ssh_command
  command: "find /tmp -type f -mtime +7 -delete"
  only_if: "disk_usage > 85%"  # Skip if already below threshold

Mechanism 3: Health Verification as Exit Condition

Idempotent Execution Logic

def execute_playbook_idempotent(playbook):
    # Check if already in desired state
    if verify_health(playbook.verification):
        log("Already in desired state, skipping execution")
        return "SUCCESS_NO_ACTION"

    # Not in desired state, execute playbook
    execute_steps(playbook.steps)

    # Verify we reached desired state
    if verify_health(playbook.verification):
        return "SUCCESS"
    else:
        return "FAILED"

# First run:  disk 91% → execute → disk 72% → SUCCESS
# Second run: disk 72% → already healthy → SUCCESS_NO_ACTION

Mechanism 2: Evaluation Logic

State Check Logic

def execute_step(step):
    # Check condition before running
    if step.only_if:
        condition_met = evaluate_condition(step.only_if)
        if not condition_met:
            log("Skipping step: condition not met")
            return "SKIPPED"

    # Condition met (or no condition), execute
    result = run_command(step.command)
    return result

# First run:
#   Condition: disk_usage > 85%, Current: 91%
#   91 > 85 = TRUE → Execute cleanup
# Second run (retry):
#   Condition: disk_usage > 85%, Current: 72%
#   72 > 85 = FALSE → Skip (already done)

Mechanism 4: Audit Trail Marks Retries

Retry Tracking

{
  "incident_id": "inc_2026_02_10_1435",
  "playbook": "disk_cleanup_prod_db",
  "executions": [
    {
      "execution_id": "exec_001",
      "timestamp": "2026-02-10T14:35:43Z",
      "result": "success",
      "duration": 87
    },
    {
      "execution_id": "exec_002",
      "timestamp": "2026-02-10T14:38:15Z",
      "result": "success_no_action",
      "duration": 2,
      "reason": "retry_after_timeout"
    }
  ]
}

Anti-Patterns (NOT Idempotent — We Reject)

Incrementing Counters

echo $((counter + 1)) > /var/retry_count

counter = 1, then 2, then 3. Side effects accumulate.

Appending to Files

echo 'Incident resolved' >> /var/log/incidents.log

Duplicate log entries on each run.

Sending Notifications

sendmail -t ops@company.com -s 'Disk cleaned'

Duplicate emails on each retry.

DB Inserts Without Check

psql -c "INSERT INTO incidents (id, status) VALUES (123, 'resolved')"

Duplicate key error or duplicate rows.

Correct Patterns (Idempotent — We Use)

Delete Files Matching Criteria

find /tmp -type f -mtime +7 -delete

Run 1: deletes 100 files. Run 2: deletes 0 (already gone). Same outcome.

Set State (Not Increment)

setquota -u postgres 100G 110G 0 0 /

Run 1: sets quota. Run 2: quota already set. No change.

Conditional Restarts

systemctl restart nginx  # only_if: service_status != active

Run 1: restarts (was failed). Run 2: skips (already active).

Upsert (Update or Insert)

INSERT INTO incidents ... ON CONFLICT (id) DO UPDATE SET status = 'resolved'

Run 1: inserts row. Run 2: updates same row. Single row, correct state.

Three Roles Prevent Runaway Automation

Autonomous systems without access controls are dangerous: junior engineers accidentally triggering production playbooks, compromised accounts executing malicious actions, bugs causing cascading failures. RBAC enforces who can see, who can approve, and who can configure.

Observer

Read-Only Access

Allowed

View incidents, playbooks, audit logs, dashboards
Search historical incidents
Export reports (audit logs, compliance evidence)
View playbook execution results

Denied

Cannot approve playbook execution
Cannot trigger playbooks manually
Cannot modify playbooks or system settings
Cannot manage users or roles

Use cases: Junior engineers, auditors, executives, external consultants, interns, contractors.

Remediation Authority

Approve & Execute

Allowed

All Observer permissions
Approve playbook execution via Slack approval gates
Manually trigger playbooks from dashboard
Override autonomous decisions (stop execution mid-flight)
Acknowledge and close incidents

Denied

Cannot create, edit, or delete playbooks
Cannot manage users or assign roles
Cannot modify system settings (thresholds, integrations)

Use cases: Senior SREs on-call, DevOps engineers managing production, infrastructure team leads.

Administrator

Full Control

Allowed

All Remediation Authority permissions
Create, edit, delete playbooks
Manage users (add, remove, assign roles)
Configure integrations (Slack, AWS, GCP, Azure)
Modify system settings (thresholds, environments, anomaly detection)
Access billing and subscription management

Use cases: DevOps team leads, platform engineering managers, CTO, system administrators.

Environment-Specific RBAC

Per-Environment Role Assignment

users:
  - email: alice.jones@company.com
    roles:
      production: observer           # Read-only in prod
      staging: administrator         # Full control in staging
      dev: administrator             # Full control in dev

  - email: bob.chen@company.com
    roles:
      production: remediation_authority  # Can approve prod fixes
      staging: remediation_authority
      dev: observer                  # Not involved in dev

  - email: charlie.wang@company.com
    roles:
      production: observer           # External auditor
      staging: observer              # Read-only everywhere
      dev: observer

Slack Approval Gate

Approval Request Message

[SentienGuard] Approval Required

Incident: prod-db-03 connection pool exhausted (98%)
Playbook: postgres_connection_reset (confidence 0.94)

This will:
  1. Kill idle connections older than 1 hour (estimated: 23)
  2. Reset connection pool to default limits
  3. Verify new connections successful

Estimated duration: 23 seconds
Risk: May interrupt long-running analytical queries

[Approve] [Deny] [View Details]

Successful Approval Log

{
  "approved_by": "bob.chen@company.com",
  "role": "Remediation Authority",
  "timestamp": "2026-02-10T02:47:52.187Z",
  "ip_address": "203.0.113.42",
  "approval_latency_seconds": 8,
  "incident_id": "inc_2026_02_10_0247",
  "playbook": "postgres_connection_reset",
  "confidence": 0.94
}

Failed Authorization Attempt

{
  "timestamp": "2026-02-10T02:48:00Z",
  "user": "alice.jones@company.com",
  "role": "Observer",
  "action": "approve_playbook",
  "incident_id": "inc_2026_02_10_0247",
  "result": "denied_insufficient_permissions",
  "ip_address": "203.0.113.42"
}

Server-Side Enforcement

def approve_playbook(user, incident_id):
    # Server-side enforcement (even if client-side bypassed)
    if user.role == "Observer":
        log_failed_authorization(user, incident_id, "approve_playbook")
        raise PermissionError("Insufficient permissions")

    # Permission check passed, proceed
    execute_playbook(incident_id)

Complete Audit Trail (End-to-End)

Audit log includes:
  - Who detected: SentienGuard anomaly engine
  - What anomaly: postgres_active_connections = 98%
  - What playbook: postgres_connection_reset v1.3.0
  - Who approved: bob.chen@company.com (Remediation Authority)
  - When approved: 2026-02-10T02:47:52Z
  - What executed: Complete command log with outputs
  - Outcome: Success, connection pool restored to 64%
  - MTTR: 31 seconds (detection to resolution)

Real Approval Workflow Result

2:47:44 AM

Detection

2:47:52 AM (8s)

Approval (Bob)

23 seconds

Execution

31 seconds

Total MTTR

Every Step Verified Before Proceeding

Execution without verification is hope, not confidence. Five verification types confirm desired state. Retry logic with exponential backoff handles slow startups. All checks must pass or rollback triggers.

Without Verification

Exit Code 0 ≠ Problem Solved

Step 1: Clear temp files
  - Command executed: find /tmp -mtime +7 -delete
  - Exit code: 0 (success)
  - Assumption: Disk space freed

Reality: /tmp was empty, no space freed
Result: Disk still 91%, incident not resolved
Problem: Didn't verify space actually freed

With Verification

Verify Actual State Change

Step 1: Clear temp files
  - Command executed: find /tmp -mtime +7 -delete
  - Exit code: 0 (success)
  - Verification: Check disk usage
    * Expected: <80%
    * Actual: 91%
    * Result: FAILED

Action: Move to Step 2 (try different cleanup method)
  OR: Rollback and escalate (if no more steps)

Five Verification Types

Type 1: Metric Verification

Metric Check

verification:
  - type: metric
    metric: disk_usage
    threshold: "< 80%"
    retry: 3
    retry_delay: 10s

Type 2: HTTP Endpoint

HTTP Health Check

verification:
  - type: http
    url: "http://localhost:8080/health"
    expected_status: 200
    timeout: 5s
    retry: 3

Type 3: Process Verification

Process State Check

verification:
  - type: process
    name: postgresql
    state: running
    min_uptime: 10s  # Must be running for at least 10 seconds

Type 4: Custom Command

Command Exit Code

verification:
  - type: command
    command: "pg_isready -U postgres"
    expected_exit_code: 0
    timeout: 5s

Type 5: Log File Check

Error Log Scan

verification:
  - type: log_check
    file: /var/log/application.log
    pattern: "ERROR|FATAL"
    max_matches: 0  # No errors expected
    lookback: 60s   # Check last 60 seconds of logs

Retry with Exponential Backoff

Retry Timeline

T+0s:  Execute playbook (restart service)
T+3s:  Attempt 1: Connection refused (service starting)
T+8s:  Attempt 2: 503 Service Unavailable (still starting)
T+18s: Attempt 3: 503 Service Unavailable
T+33s: Attempt 4: 200 OK ← SUCCESS
       Skip attempt 5, verification passed

Total verification time: 33 seconds

Composite Verification (Multiple Checks)

All Checks Must Pass

verification:
  - type: metric
    metric: disk_usage
    threshold: "< 80%"

  - type: http
    url: "http://localhost:8080/health"
    expected_status: 200

  - type: log_check
    file: /var/log/app.log
    pattern: "ERROR"
    max_matches: 0

# ALL checks must pass for verification to succeed

Composite Execution Result

Pass/Fail per Check

Check 1: Disk usage
  Current: 72.1%
  Threshold: <80%
  Result: PASS ✓

Check 2: HTTP health endpoint
  Response: 200 OK
  Result: PASS ✓

Check 3: Error logs
  Matches: 0 errors
  Max: 0
  Result: PASS ✓

Overall: PASS (all 3 checks passed)

Automatic Revert When Health Checks Fail

If verification fails, undo changes and return to previous state. Better to abort than leave system broken. Rollback triggers on failed health checks, step execution failures, precondition violations, and manual abort.

Example: nginx_restart Playbook

name: nginx_restart
steps:
  - name: stop_nginx
    command: systemctl stop nginx
    rollback: systemctl start nginx

  - name: clear_cache
    command: rm -rf /var/cache/nginx/*
    rollback: skip  # Can't restore cache

  - name: start_nginx
    command: systemctl start nginx
    rollback: systemctl stop nginx

verification:
  - type: http
    url: http://localhost:80/health
    expected_status: 200
    retry: 3

Success Path

Verification Passes

Step 1: Stop nginx → SUCCESS
Step 2: Clear cache → SUCCESS
Step 3: Start nginx → SUCCESS
Verification: HTTP 200 → PASS

No rollback needed

Failure + Rollback

Verification Fails → Rollback

Step 1: Stop nginx → SUCCESS
Step 2: Clear cache → SUCCESS
Step 3: Start nginx → SUCCESS (exit code 0, but service crashes)
Verification: HTTP GET localhost:80/health
  Attempt 1: Connection refused → FAIL
  Attempt 2: Connection refused → FAIL
  Attempt 3: Connection refused → FAIL

Verification FAILED after 3 retries

Rollback triggered:
  Step 3 rollback: systemctl stop nginx
  Step 2 rollback: skip (can't restore cache)
  Step 1 rollback: systemctl start nginx

Result: nginx restarted (back to original state)
Status: Incident marked FAILED, escalated to human

Rollback Limitations

Deleted Files

Permanent loss, can't restore without backup.

Solution: Design playbooks to be idempotent (safe to retry).

External API Calls

Already executed, can't un-launch instances.

Solution: Check if resources exist before creating.

Database Writes

Unless transaction-wrapped, can't undo.

Solution: Use database transactions or idempotent upserts.

Sent Notifications

Can't unsend messages.

Solution: Send notifications AFTER verification passes.

Partial Rollback (Mixed Steps)

# Some steps can rollback, others can't
steps:
  - name: stop_service
    rollback: start_service    # Can rollback

  - name: delete_cache
    rollback: skip             # Can't rollback (files deleted)

  - name: update_config
    rollback: restore_old_config  # Can rollback (config backed up)

# If verification fails:
#   Rollback what's possible (restart service, restore config)
#   Escalate with partial rollback report

Rollback Timeout Enforcement

steps:
  - name: database_migration
    command: psql -f /migrations/v2.sql
    timeout: 300s  # 5 minutes
    rollback: psql -f /migrations/v2_rollback.sql
    rollback_timeout: 60s  # 1 minute (rollback should be faster)

# Why: Rollback is emergency recovery.
# If it takes too long, system stays broken.

Best Practices

Prefer Idempotency

Design playbooks safe to retry rather than relying on rollback.

Test Rollback Paths

Force verification failure in staging to confirm rollback works.

Partial Rollback

Rollback what's possible, escalate with partial report.

Rollback Timeout

Rollback should complete faster than forward execution.

SSH, Kubernetes, Cloud APIs

Three execution methods cover all infrastructure: SSH for Linux server operations, kubectl for Kubernetes, and cloud provider CLIs for AWS, GCP, and Azure resources.

SSH Commands

Linux server operations: disk cleanup, service restarts, file operations. Agent executes locally via SSH to localhost. SSH keys stored in Secrets Manager, retrieved just-in-time. Sudo allowed for specific commands only via /etc/sudoers.d/.

Kubernetes (kubectl)

Kubernetes operations: pod restarts, scaling, rollbacks. Uses ServiceAccount credentials with least-privilege RBAC. Namespace-scoped: production ServiceAccount can't touch staging.

Cloud Provider APIs

AWS CLI, gcloud, az CLI for cloud infrastructure. IAM roles (AWS), Service Principals (Azure), Service Accounts (GCP). No static API keys stored. Actions limited to playbook requirements.

Method 1: SSH Commands

- name: clear_temp_files
  action: ssh_command
  command: "find /tmp -type f -mtime +7 -delete"
  user: sentienguard  # SSH as this user
  timeout: 60s

# Agent executes locally (SSH to localhost)
# Captures: exit code, stdout, stderr, duration
# Security: SSH keys from Secrets Manager (just-in-time)
# Sudo: allowed for specific commands only

Method 2: Kubernetes (kubectl)

- name: restart_failed_pod
  action: kubectl_delete
  resource: pod
  selector: "status=CrashLoopBackOff"
  namespace: production
  timeout: 30s

# Agent uses Kubernetes API via ServiceAccount
# What happens:
#   1. Kubernetes deletes matching pods
#   2. ReplicaSet/Deployment creates new pods
#   3. New pods start fresh (no inherited crash state)

Kubernetes RBAC Permissions

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sentienguard-agent
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "delete"]  # Pod restarts

  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["get", "patch"]  # Scaling, rollbacks

Method 3: AWS CLI

- name: snapshot_ebs_volume
  action: aws_cli
  command: |
    aws ec2 create-snapshot \
      --volume-id vol-1234567890abcdef \
      --description "Automated snapshot before maintenance"
  timeout: 300s

Method 3: GCP CLI

- name: scale_instance_group
  action: gcloud
  command: |
    gcloud compute instance-groups managed resize my-group \
      --size 10 \
      --zone us-central1-a
  timeout: 120s

Method 3: Azure CLI

- name: restart_vm
  action: az_cli
  command: |
    az vm restart \
      --resource-group my-rg \
      --name prod-vm-01
  timeout: 180s

Method	Use Case	Auth	Speed	Complexity
SSH	Linux server ops	SSH keys	Fast (<5s)	Low
kubectl	Kubernetes ops	ServiceAccount	Fast (<10s)	Medium
AWS CLI	AWS resources	IAM role	Medium (10-60s)	Medium
gcloud	GCP resources	Service Account	Medium (10-60s)	Medium
az	Azure resources	Service Principal	Medium (10-60s)	Medium

Common Questions

Playbooks execute steps sequentially. If Step 3 fails, Steps 1-2 already completed. Rollback executes in reverse: undo Step 2, undo Step 1. If rollback succeeds, system returns to pre-playbook state. If rollback fails, escalate to human with partial rollback report.

Yes. Three testing methods: (1) Dry-run mode (simulate execution, no actual commands run), (2) Staging environment (run on staging hosts first), (3) Manual trigger (engineer triggers playbook manually, reviews result before enabling autonomous execution).

Set approval requirement: approval_gate: required: true. Playbook will always require human approval via Slack, never runs autonomously. Or disable playbook entirely for specific hosts: exclusions: host_pattern: "*.prod.*".

Dashboard → Playbooks → disk_cleanup_prod_db → View YAML. Complete playbook definition visible, including all commands, verification checks, and rollback steps. Audit logs show exact commands executed historically.

No. Playbooks are immutable (version-controlled). Only Admins can create/edit playbooks via dashboard. Playbooks cannot self-modify or spawn new playbooks. Prevents runaway automation.

Playbooks only execute once per incident. After execution, incident marked "resolved" or "failed." Same incident never triggers same playbook twice. If problem recurs, new incident created (different incident ID).

See Automated Remediation in Action

Deploy agents, import disk_cleanup playbook, fill disk to 90%, watch autonomous execution, review health verification, examine audit log.

Test Playbook

name: disk_cleanup_test
steps:
  - name: create_large_file
    command: "dd if=/dev/zero of=/tmp/testfile bs=1M count=1000"
  - name: verify_disk_high
    command: "df -h / | awk 'NR==2 {print $5}'"
  - name: delete_large_file
    command: "rm /tmp/testfile"
  - name: verify_disk_normal
    command: "df -h / | awk 'NR==2 {print $5}'"

Start Free (3 Nodes)View Playbook Library →

Free tier: 3 nodes, 50+ pre-built playbooks, full RBAC, complete audit logs, no credit card.