# Brittle rule example
if metric == "disk_usage" AND value > 85%:
execute playbook "disk_cleanup"RAG Intelligence
Semantic Playbook Matching Beats Brittle If/Then Rules
1536-dimension vector embeddings match incidents to remediation strategies based on context, not keyword matching. Confidence scoring determines autonomous execution vs human approval. <165ms selection latency from detection to playbook dispatch.
Why If/Then Logic Breaks at Scale
Traditional monitoring uses rules engines. They optimize for speed (simple if/then checks) but sacrifice accuracy. RAG inverts this: slightly slower selection (165ms vs 10ms) but dramatically higher accuracy (94% vs 60% correct playbook).
3:00 AM - Disk usage 86% on log-aggregator-03
3:00 AM - Rule matches: disk_usage > 85%
3:00 AM - Executes: disk_cleanup_temp_files playbook
3:02 AM - Playbook deletes /tmp (empty, no space freed)
3:02 AM - Disk still 86% (real cause: log rotation failed)
3:02 AM - Alert re-fires
3:02 AM - Rule matches again, executes same playbook
3:04 AM - Infinite loop until human intervenesContext-blind matching
Rule: If disk >85%, clean temp files
Reality: Production database vs dev server vs log aggregator
Problem: Same threshold, different root causes, wrong fix
Keyword dependency
Rule: Rule matches "disk_usage" exactly
Reality: Misses: "filesystem_full", "storage_capacity", "volume_usage"
Problem: Synonyms break matching
Maintenance nightmare
Rule: 500 servers × 20 metrics = 10,000 potential rules
Reality: Every new service = new rules. Every threshold change = rule update.
Problem: Rules multiply faster than humans can maintain
No learning
Rule: Rule executes same playbook forever
Reality: Never learns: "This playbook failed 5 times on this host type"
Problem: Repeats failures, no improvement
Binary decisions
Rule: Match or no match (0% or 100% confidence)
Reality: Incidents have nuance
Problem: Can't express "probably this playbook, but verify first"
Retrieval-Augmented Generation Pipeline
Four stages from raw incident data to playbook selection. Total pipeline latency: 50ms + 100ms + 15ms + 10ms = 175ms typical (target: <165ms).
Incident Embedding
<50ms- Incident data converted to natural language description
- Passed to OpenAI embedding model (text-embedding-3-large)
- Output: 1536-dimension vector representing incident semantics
{
"host": "prod-db-03.us-east-1",
"metric": "disk_usage",
"value": 91.4,
"baseline": 68.2,
"deviation": 4.8,
"environment": "production",
"service": "postgresql",
"time": "2026-02-10T14:35:42Z"
}"Production PostgreSQL database server prod-db-03 in us-east-1
experiencing disk usage anomaly: 91.4% observed, 68.2% expected,
4.8 standard deviations above baseline at 2:35 PM on Tuesday."[0.023, -0.891, 0.445, ..., 0.129] // 1536 numbersSemantic Search
<100ms- Incident vector compared to all playbook vectors in library
- Cosine similarity calculated (measures angle between vectors)
- Top 5 most similar playbooks retrieved
- Library: 50+ pre-built + unlimited custom playbooks
similarity = cosine(incident_vector, playbook_vector)
= dot_product(A, B) / (magnitude(A) * magnitude(B))
= 0.94 // Higher = more similar1. disk_cleanup_prod_db (similarity: 0.94)
2. disk_cleanup_general (similarity: 0.87)
3. log_rotation_postgres (similarity: 0.82)
4. database_vacuum (similarity: 0.76)
5. filesystem_expansion (similarity: 0.71)Context Filtering
<15ms- Top 5 candidates filtered by metadata constraints
- Host type, environment, time-of-day, historical success
- Incompatible playbooks removed before scoring
# Playbook metadata
playbook: disk_cleanup_prod_db
constraints:
host_pattern: "*.db.*" # Must match database servers
service: "postgresql" # Must be PostgreSQL
environment: ["production"] # Production onlyFiltering Results
disk_cleanup_prod_db
host=prod-db-03, service=postgresql, env=production
disk_cleanup_general
No constraints (universal playbook)
log_rotation_postgres
Matches postgresql service
database_vacuum
Constraint: only run during maintenance windows
filesystem_expansion
Constraint: requires approval, cloud provider API
Historical Success Rates
disk_cleanup_prod_db
46/47 runs successful · avg 87s
disk_cleanup_general
178/203 runs successful · avg 62s
log_rotation_postgres
35/38 runs successful · avg 45s
Confidence Scoring
<10ms- Final playbook selected based on weighted score
- Confidence determines: autonomous, approval-required, or escalate
confidence = (0.6 × semantic_similarity) +
(0.3 × historical_success_rate) +
(0.1 × recency_boost)
disk_cleanup_prod_db:
= (0.6 × 0.94) + (0.3 × 0.979) + (0.1 × 1.0)
= 0.564 + 0.294 + 0.100
= 0.958 // 95.8% confidenceConfidence Thresholds
No approval needed
Slack notification, human confirms
No playbook match confident enough
Final Selection
disk_cleanup_prod_db
Confidence: 0.958 (95.8%)
Action: Execute autonomously (>0.90 threshold)
Estimated duration: 87 seconds (historical average)
Real Incident → Playbook Selection
Walk through a real incident matching flow: from raw metrics to autonomous remediation with full audit trail.
Incident Details
{
"incident_id": "inc_2026_02_10_1435",
"timestamp": "2026-02-10T14:35:42.124Z",
"host": "prod-db-03.us-east-1",
"environment": "production",
"service": "postgresql",
"metric": "disk_usage",
"current_value": 91.4,
"baseline": 68.2,
"deviation": 4.8,
"severity": "critical"
}Natural Language
“Production PostgreSQL database prod-db-03 in us-east-1 experiencing critical disk usage: 91.4% current vs 68.2% baseline, 4.8 standard deviations above normal.”
Top 3 Playbook Candidates
Semantic similarity: 0.94
Historical success: 97.9% (46/47 runs)
Constraints: ✓ host=*.db.*, service=postgresql, env=production
Last run: 24 hours ago (successful)
Avg duration: 87 seconds
Final confidence: 0.958
Semantic similarity: 0.87
Historical success: 87.7% (178/203 runs)
Constraints: ✓ No constraints (universal)
Last run: 5 hours ago (successful)
Avg duration: 62 seconds
Final confidence: 0.854
Semantic similarity: 0.82
Historical success: 92.1% (35/38 runs)
Constraints: ✓ service=postgresql
Last run: 3 days ago (successful)
Avg duration: 45 seconds
Final confidence: 0.821
name: disk_cleanup_prod_db
version: 1.4.2
steps:
- name: clear_temp_files
command: "find /tmp -type f -mtime +7 -delete"
- name: rotate_logs
command: "logrotate -f /etc/logrotate.conf"
- name: verify_space_freed
health_check: "disk_usage < 80%"Selection Decision
Selected: disk_cleanup_prod_db
Reason: Highest confidence (0.958 > 0.90 threshold)
Action: Execute autonomously
Notification: Informational Slack message (not approval request)
Outcome
Execution time: 87 seconds
Disk usage: 91.4% → 72.1%
Health verification: PASS
Status: Resolved autonomously
Confidence Improves Over Time
New playbooks start conservative (human oversight) and earn autonomy through proven success. This prevents “AI running wild” while allowing automation to scale as confidence builds.
Confidence Score Over Time
First Deployment
Escalate to humanTotal runs: 0
Success: N/A
Confidence: 68.4%
(0.6 × 0.89) + (0.3 × 0.50 assumed) + (0.1 × 0.0)
Result: Human reviews, approves manually, playbook succeeds
Building Confidence
Approval requiredTotal runs: 3
Success: 100% (3/3)
Confidence: 88.4%
(0.6 × 0.89) + (0.3 × 1.00) + (0.1 × 0.5)
Result: Slack notification, human approves, playbook succeeds
Approaching Autonomy
Still approval-requiredTotal runs: 12
Success: 91.7% (11/12)
Confidence: 88.9%
(0.6 × 0.89) + (0.3 × 0.917) + (0.1 × 0.8)
Result: Human approves 12 times, all successes
Autonomous
Execute autonomouslyTotal runs: 47
Success: 97.9% (46/47)
Confidence: 92.8%
(0.6 × 0.89) + (0.3 × 0.979) + (0.1 × 1.0)
Result: No human approval needed, runs automatically
How Playbooks Are Stored and Matched
Every playbook is a YAML file with execution steps, metadata for RAG matching, and historical performance stats that update after each run.
# Playbook YAML file
name: disk_cleanup_prod_db
version: 1.4.2
description: |
Clear disk space on production database servers by removing
temporary files older than 7 days and rotating logs. Targets
PostgreSQL servers experiencing disk usage >85%.
# Metadata for RAG matching
metadata:
tags: ["disk", "cleanup", "database", "postgresql", "storage"]
host_pattern: "*.db.*"
service: "postgresql"
environment: ["production"]
severity: ["warning", "critical"]
# Vector embedding (computed at playbook creation)
embedding: [0.023, -0.891, 0.445, ..., 0.129] # 1536 dimensions
# Historical performance (updated after each run)
performance:
total_runs: 47
successful: 46
failed: 1
success_rate: 0.979
avg_duration_seconds: 87
last_run: "2026-02-09T03:12:45Z"
last_result: "success"
# Execution steps
steps:
- name: clear_temp_files
action: ssh_command
command: "find /tmp -type f -mtime +7 -delete"
timeout: 60s
- name: rotate_logs
action: ssh_command
command: "logrotate -f /etc/logrotate.conf"
timeout: 60s
- name: verify_space_freed
action: health_check
metric: disk_usage
threshold: "< 80%"
retry: 3
retry_delay: 10sLibrary Organization
Playbook Library (vector database)
\u251c\u2500\u2500 Pre-built Playbooks (50+)
├── disk_cleanup_linux
├── disk_cleanup_prod_db
├── memory_restart_service
├── k8s_pod_restart
├── postgres_connection_reset
└── ssl_cert_renewal
\u251c\u2500\u2500 Custom Playbooks (unlimited)
├── custom_app_restart
└── custom_cache_clear
\u2514\u2500\u2500 Embeddings Index
├── Incident vectors → Playbook vectors
├── Similarity search <100ms
└── Supports 10,000+ playbooks
Search Performance
Why RAG Outperforms Traditional Rules
165ms extra latency to avoid executing the wrong playbook is an excellent trade.
| Dimension | Rules Engine | RAG Intelligence | Winner |
|---|---|---|---|
| Matching Method | Exact keyword match | Semantic similarity | RAG |
| Context Awareness | None (if metric=="X") | Full (host, service, time, history) | RAG |
| Synonyms | Fail (must match exactly) | Handle automatically | RAG |
| Maintenance | Manual (update rules per change) | Automatic (learns from embeddings) | RAG |
| New Playbooks | Write new rules for each | Auto-indexed, immediately searchable | RAG |
| Confidence | Binary (match or no match) | Scored (0.0–1.0 confidence) | RAG |
| Learning | Static (never improves) | Dynamic (confidence increases) | RAG |
| Selection Speed | Faster (10ms if/then) | Slower (165ms embedding + search) | Rules |
| Selection Accuracy | Lower (60–70% correct) | Higher (90–95% correct) | RAG |
| False Positives | High (wrong playbook executed) | Low (low confidence = escalate) | RAG |
| Scalability | Poor (rules multiply) | Excellent (vector search scales) | RAG |
Rules Engine Results (100 disk incidents)
RAG Intelligence Results (100 disk incidents)
RAG Pipeline Components
End-to-end architecture from incident detection to playbook dispatch. Every component optimized for production latency and reliability.
Incident Detection
- Receives structured incident data from anomaly detection engine
- Includes host, metric, value, baseline, deviation, environment, service
Natural Language Conversion
- Converts structured JSON to human-readable incident description
- Captures full semantic context for accurate embedding
OpenAI Embedding Model (text-embedding-3-large)
<50ms- Input: Natural language text
- Output: 1536-dimension vector
- Cost: ~$0.0001 per incident
- Alternative: Self-hosted (sentence-transformers) for air-gapped
Vector Database (Pinecone / Weaviate / Qdrant)
<100ms- Index type: HNSW (Hierarchical Navigable Small World)
- Distance metric: Cosine similarity
- Returns: Top 5 most similar playbooks
- Persistence: Disk-backed (survives restarts)
Context Filtering Engine
<15ms- Host pattern matching, environment restrictions
- Historical success rate comparison
- Time-of-day constraints
- Caching: Recent incidents cached for 5 minutes
Confidence Scoring
<10ms- Formula: 0.6×similarity + 0.3×success_rate + 0.1×recency
- Thresholds configurable per organization
- Override: Admins can force autonomous/approval per playbook
Selected Playbook Dispatched
Total: <175msPlaybook name, confidence score, execution mode, and estimated duration sent to execution orchestrator.
Beyond Basic Matching
Multi-Metric Correlation
Problem: Single metric anomaly might not warrant playbook execution.
Solution: RAG considers multiple related metrics. If disk_usage is 91% but inode_usage, write_iops, and read_latency are all normal, confidence drops and approval is required instead of autonomous execution.
Disk high but performance normal → likely expected behavior → lower confidence (0.82), require approval
Time-of-Day Awareness
Problem: Same metric at different times can mean different root causes.
Solution: Embedding includes time context. CPU 85% at 2 PM matches traffic_surge_scaling (scale horizontally). CPU 85% at 3 AM matches background_job_throttle (slow down batch processing).
Same metric, different playbooks based on time of day
Incident Clustering
Problem: 5 servers with same issue = 1 root cause, not 5 separate incidents.
Solution: RAG clusters similar incidents. When prod-db-01, 02, and 03 all show disk 90%+ within 2 minutes, it identifies the shared root cause (e.g., shared NFS mount full) and executes one cluster-wide playbook.
3 servers alerted → cluster detected → 1 playbook execution, not 3
Negative Matching
Problem: Some playbooks should NEVER run on certain hosts or during certain windows.
Solution: Exclusion rules filter out dangerous playbooks before scoring. aggressive_cache_clear never runs on production, never during backup windows (00:00–06:00), and never during major outages (>5 concurrent incidents).
Excluded playbooks removed before confidence scoring
Common Questions
Confidence score falls below the 0.70 threshold. SentienGuard escalates to a human via Slack, email, or PagerDuty. You investigate manually, then create a new playbook for future occurrences. After 3–5 successful manual resolutions, RAG has enough confidence to start running autonomously.
Yes. Admins can manually trigger any playbook from the dashboard. Useful for testing new playbooks or handling edge cases. Manual triggers are still logged in the immutable audit trail.
Write a detailed description in YAML metadata. RAG embeds the description, so natural language clarity matters more than keywords. Good: "Clear disk space on PostgreSQL production databases by removing temporary files older than 7 days and rotating application logs. Use when disk usage exceeds 85% and database performance is unaffected." Bad: "Cleans disk".
Traditional ML detects anomalies (what's broken). RAG selects remediation (how to fix it). Different problems, complementary solutions. SentienGuard uses both: statistical ML for anomaly detection, RAG for playbook selection.
No. RAG only selects from your existing playbook library. It cannot invent new remediation steps. This is "Retrieval-Augmented" Generation—retrieval constrains output to real playbooks. If no playbook matches (confidence <0.70), RAG escalates rather than guessing.
Approximately $0.0001 per incident (negligible). For 10,000 incidents/month that's about $1/month in embedding cost. Platform cost ($4/node) includes AI service fees—no surprise bills.
Try RAG Intelligence
- 1Deploy SentienGuard agents on 3 nodes (free tier).
- 2Import pre-built playbook library (50+ playbooks, already embedded).
- 3Trigger test incident (fill disk to 90%).
- 4Watch RAG select playbook in <165ms.
- 5Review confidence score and execution results.
Free tier: 3 nodes, unlimited playbooks, full audit logs, no credit card required.