SentienGuard
Home>Product>Self-Healing Engine

THE SELF-HEALING ENGINE

Deterministic Automation.
Not Magic AI.

The Self-Healing Engine detects incidents, selects remediation playbooks, executes fixes autonomously, verifies outcomes, and logs everything immutably. Here is exactly how it works.

<90s

Mean Time to Resolution

From detection to verified fix

87%

Autonomous Resolution Rate

No human intervention required

100%

Actions Logged

Immutable audit trail

How Autonomous Healing Works

Five deterministic and auditable stages from anomaly detection to verified resolution.

Stage 1

Anomaly Detection

Lightweight agents monitor infrastructure with eBPF and OpenTelemetry while dynamic baselines detect meaningful deviations.

  • Data sources: eBPF, OTel, cloud APIs, Kubernetes API
  • 7-day rolling baselines with time-of-day context
  • Trigger threshold: >2 standard deviations
Anomaly Detection Example
Metric: disk_usage
Current: 91.4%
Baseline (7d avg): 68.2%
Deviation: +34%
Status: CRITICAL

eBPF | OpenTelemetry | Prometheus-compatible

Next: Incident data + context

Stage 2

AI Decision Engine

RAG-assisted selection matches incidents to proven playbooks and assigns confidence. AI helps choose, playbooks execute.

  • Vector search returns top candidate playbooks
  • Context matching by environment and host profile
  • Confidence thresholds drive autonomous, approval, or escalation paths
AI Decision Engine Example
Incident: disk_usage > 85% on prod-db-03
1) disk_cleanup_prod_db  confidence: 0.94
2) disk_expansion_ebs    confidence: 0.72
3) log_rotation_generic  confidence: 0.68
Selected: disk_cleanup_prod_db

LangChain | OpenAI Embeddings | Pinecone/Weaviate

Next: Selected playbook + confidence

Stage 3

RBAC + Approval Gate

Production fixes can require explicit approval from authorized roles before execution begins.

  • Slack approval webhook to incident channel
  • Role checks for Remediation Authority
  • Configurable timeout and fallback behavior
RBAC + Approval Gate Example
IF environment == "production":
  send_slack_approval()
  wait_for_authorized_approver(timeout=5m)
ELSE:
  execute_now()

Slack Webhooks | RBAC | Timeout Logic

Next: Approved execution

Stage 4

Playbook Execution

Commands execute sequentially with retries, timeouts, and complete stdout/stderr capture.

  • Supports SSH, Kubernetes API, cloud SDKs
  • Per-step timeout and exponential backoff
  • Structured logs with millisecond timestamps
Playbook Execution Example
{
  "step": "clear_temp_files",
  "command": "find /tmp -type f -mtime +7 -exec mv {} /tmp/.trash/ \;",
  "duration_ms": 3767,
  "exit_code": 0,
  "stdout": "Deleted 1,247 files (8.3 GB freed)"
}

node-ssh | Kubernetes API | AWS/GCP/Azure SDKs

Next: Execution output + logs

Stage 5

Health Verification

Post-remediation checks confirm the outcome. Failure triggers retries, rollback, or human escalation.

  • Metric checks against safe thresholds
  • HTTP and readiness probe validation
  • Rollback workflows for failed verification
Health Verification Example
Wait: 10s
Recheck: disk_usage
Expected: < 80%
Actual: 72.1%
Result: PASS

HTTP Clients | K8s Probes | Custom Validators

Watch a Real Incident Get Resolved

From detection to resolution in 87 seconds. Every step explained.

14:35:4200:00

Incident Detected

  • Host: prod-db-03.aws.us-east-1
  • Metric: disk_usage = 91.4%
  • Baseline: 68.2%
  • Severity: Critical
14:35:4500:03

Playbook Selected

  • Top match: disk_cleanup_prod_db
  • Confidence: 0.94
  • Historical success: 127/129
  • Action: Approval required
14:35:5100:09

Approval Granted

  • Approver: john.chen@company.com
  • Role: Remediation Authority
  • Approval latency: 3 seconds
14:35:5600:14

Cleanup Executed

  • 1,247 temp files moved
  • 8.3 GB freed from /tmp
  • Rollback window: 24 hours
14:37:0901:27

Resolved + Logged

  • Final disk usage: 72.1%
  • Total duration: 87 seconds
  • Immutable audit log created

Outcome Summary

Detection to resolution: 87 seconds. Human involvement: 3 seconds. Manual work saved: ~45 minutes.

Resolution Audit Record
{
  "incident_id": "inc_2026_02_10_1435",
  "duration_seconds": 87,
  "status": "resolved",
  "playbook": "disk_cleanup_prod_db",
  "approver": "john.chen@company.com",
  "space_freed": "11.4 GB",
  "verification": "passed"
}

Safety First. Autonomy Second.

Multiple layers of protection prevent runaway automation.

Observation to Autonomous Modes

Start with observation, then approval mode, then autonomous only after repeated successful runs.

mode: approval
approval_gate:
  required_for: [production, staging]
  timeout: 5m

RBAC Enforcement

Observer, Remediation Authority, and Admin roles enforce least-privilege approvals and overrides.

Dry Run Mode

Preview exact commands and predicted results before enabling execution.

Manual Override

Abort active execution and escalate control to human operators at any point.

Automatic Rollback

Failed verification can trigger snapshots, service restart, and post-rollback checks.

rollback:
  on_verification_failure: true
  notify: "#ops-critical"

Timeout Protection

Every step includes explicit timeout bounds to avoid hung execution paths.

steps:
  - name: clear_temp_files
    timeout: 30s

System Architecture

How the collection, decision, execution, and human layers work together.

Layer 3: Human Interaction

Web Dashboard: command center, incident timeline, playbook management
Slack Integration: approval requests, notifications, RBAC decisions
API Access: REST APIs, webhooks, export endpoints

Layer 2: SentienGuard Platform

Health Monitor and anomaly scoring
AI Decision Engine and confidence scoring
Execution Engine (SSH, K8s API, Cloud SDKs)
Immutable audit log storage + search index

Layer 1: Data Collection

Agents: eBPF + OpenTelemetry + secure command channels
Cloud APIs: AWS CloudWatch, GCP Monitoring, Azure Monitor
Kubernetes API: pod status, node health, resource usage

Data flow: Infrastructure → Detection → Decision → Approval/Execution → Verification → Audit log.

Built for Speed and Reliability

<5s

Detection Latency

eBPF and local collection are near real-time. Cloud API polling is typically the slowest source.

<3s

Playbook Selection

Vector search, context matching, and confidence scoring complete quickly for common incidents.

Variable

Execution Speed

Simple fixes complete in seconds; complex infrastructure changes can require minutes.

<1% CPU

System Overhead

Agent footprint is low: roughly 50MB RAM and ~0.3% CPU on average.

How This Differs From What You're Using Today

AspectTraditional MonitoringRunbook AutomationSentienGuardHuman SRE
DetectionReal-time metricsManual triggerReal-time + anomaly detectionReactive (paged)
DecisionHuman decides fixStatic if/then rulesAI-assisted playbook selectionHuman judgment
ExecutionHuman runs commandsAutomated scriptsAutomated with RBAC gatesManual execution
VerificationManual checksBasic checksAutomated verification + rollbackManual verification
Audit TrailManual notesBasic logsImmutable timestamped recordInconsistent docs
MTTR4+ hours30-60 minutes<90 seconds (common incidents)45+ minutes

Technical Questions

What if AI selects the wrong playbook?

AI selects among your approved playbooks. Low-confidence incidents escalate to humans instead of guessing.

How do baselines avoid false positives?

Baselines use rolling historical windows and time-of-day context, then trigger on significant deviation.

Can playbooks call external APIs?

Yes. SSH, Kubernetes APIs, cloud SDKs, webhooks, Terraform, and scripts are supported.

How do you prevent infinite loops?

Cooldowns, repeated-trigger limits, and automatic escalation prevent self-trigger spirals.

How are secrets handled?

Secrets are referenced by name and fetched at runtime from your secret manager; values are not logged.

Try the Self-Healing Engine

1. Deploy Agents

Install on dev/staging nodes in observation mode.

Install Agent
curl -sSL get.sentienguard.com/agent | sh

Time: 2 minutes

2. Import Playbooks

Choose safe starter playbooks from the library and set approval mode.

Time: 5 minutes

3. Trigger First Incident

Run a controlled trigger, approve in Slack, and watch end-to-end verification.

Time: 2 minutes

Total time to first autonomous resolution: 8-10 minutes

Understand How It Works.
Then Watch It Work.

Download the architecture whitepaper for technical details, or see the Self-Healing Engine live in an 8-minute sandbox.

Built by ex-AWS/Google engineers. Production-ready from day one.