SentienGuard
Home>Product>RAG Intelligence

RAG Intelligence

Semantic Playbook Matching Beats Brittle If/Then Rules

1536-dimension vector embeddings match incidents to remediation strategies based on context, not keyword matching. Confidence scoring determines autonomous execution vs human approval. <165ms selection latency from detection to playbook dispatch.

165msPlaybook selection latencyRAG semantic search
0.94Average confidence scoreHigh-certainty matches
1536Embedding dimensionsOpenAI text-embedding-3-large

Why If/Then Logic Breaks at Scale

Traditional monitoring uses rules engines. They optimize for speed (simple if/then checks) but sacrifice accuracy. RAG inverts this: slightly slower selection (165ms vs 10ms) but dramatically higher accuracy (94% vs 60% correct playbook).

Brittle Rule Example
# Brittle rule example
if metric == "disk_usage" AND value > 85%:
    execute playbook "disk_cleanup"
Example Failure Cascade
3:00 AM - Disk usage 86% on log-aggregator-03
3:00 AM - Rule matches: disk_usage > 85%
3:00 AM - Executes: disk_cleanup_temp_files playbook
3:02 AM - Playbook deletes /tmp (empty, no space freed)
3:02 AM - Disk still 86% (real cause: log rotation failed)
3:02 AM - Alert re-fires
3:02 AM - Rule matches again, executes same playbook
3:04 AM - Infinite loop until human intervenes
1

Context-blind matching

Rule: If disk >85%, clean temp files

Reality: Production database vs dev server vs log aggregator

Problem: Same threshold, different root causes, wrong fix

2

Keyword dependency

Rule: Rule matches "disk_usage" exactly

Reality: Misses: "filesystem_full", "storage_capacity", "volume_usage"

Problem: Synonyms break matching

3

Maintenance nightmare

Rule: 500 servers × 20 metrics = 10,000 potential rules

Reality: Every new service = new rules. Every threshold change = rule update.

Problem: Rules multiply faster than humans can maintain

4

No learning

Rule: Rule executes same playbook forever

Reality: Never learns: "This playbook failed 5 times on this host type"

Problem: Repeats failures, no improvement

5

Binary decisions

Rule: Match or no match (0% or 100% confidence)

Reality: Incidents have nuance

Problem: Can't express "probably this playbook, but verify first"

Retrieval-Augmented Generation Pipeline

Four stages from raw incident data to playbook selection. Total pipeline latency: 50ms + 100ms + 15ms + 10ms = 175ms typical (target: <165ms).

1

Incident Embedding

<50ms
  • Incident data converted to natural language description
  • Passed to OpenAI embedding model (text-embedding-3-large)
  • Output: 1536-dimension vector representing incident semantics
Input: Incident Data
{
  "host": "prod-db-03.us-east-1",
  "metric": "disk_usage",
  "value": 91.4,
  "baseline": 68.2,
  "deviation": 4.8,
  "environment": "production",
  "service": "postgresql",
  "time": "2026-02-10T14:35:42Z"
}
Natural Language Conversion
"Production PostgreSQL database server prod-db-03 in us-east-1
experiencing disk usage anomaly: 91.4% observed, 68.2% expected,
4.8 standard deviations above baseline at 2:35 PM on Tuesday."
Embedding Output (1536-dim)
[0.023, -0.891, 0.445, ..., 0.129]  // 1536 numbers
2

Semantic Search

<100ms
  • Incident vector compared to all playbook vectors in library
  • Cosine similarity calculated (measures angle between vectors)
  • Top 5 most similar playbooks retrieved
  • Library: 50+ pre-built + unlimited custom playbooks
Similarity Calculation
similarity = cosine(incident_vector, playbook_vector)
           = dot_product(A, B) / (magnitude(A) * magnitude(B))
           = 0.94  // Higher = more similar
Top 5 Results
1. disk_cleanup_prod_db        (similarity: 0.94)
2. disk_cleanup_general        (similarity: 0.87)
3. log_rotation_postgres       (similarity: 0.82)
4. database_vacuum             (similarity: 0.76)
5. filesystem_expansion        (similarity: 0.71)
3

Context Filtering

<15ms
  • Top 5 candidates filtered by metadata constraints
  • Host type, environment, time-of-day, historical success
  • Incompatible playbooks removed before scoring
Filter Constraints
# Playbook metadata
playbook: disk_cleanup_prod_db
constraints:
  host_pattern: "*.db.*"        # Must match database servers
  service: "postgresql"         # Must be PostgreSQL
  environment: ["production"]   # Production only

Filtering Results

disk_cleanup_prod_db

host=prod-db-03, service=postgresql, env=production

disk_cleanup_general

No constraints (universal playbook)

log_rotation_postgres

Matches postgresql service

database_vacuum

Constraint: only run during maintenance windows

filesystem_expansion

Constraint: requires approval, cloud provider API

Historical Success Rates

disk_cleanup_prod_db

46/47 runs successful · avg 87s

97.9%

disk_cleanup_general

178/203 runs successful · avg 62s

87.7%

log_rotation_postgres

35/38 runs successful · avg 45s

92.1%
4

Confidence Scoring

<10ms
  • Final playbook selected based on weighted score
  • Confidence determines: autonomous, approval-required, or escalate
Scoring Formula
confidence = (0.6 × semantic_similarity) +
             (0.3 × historical_success_rate) +
             (0.1 × recency_boost)

disk_cleanup_prod_db:
  = (0.6 × 0.94) + (0.3 × 0.979) + (0.1 × 1.0)
  = 0.564 + 0.294 + 0.100
  = 0.958  // 95.8% confidence

Confidence Thresholds

> 0.90Autonomous execution

No approval needed

0.70 – 0.90Approval required

Slack notification, human confirms

< 0.70Escalate to human

No playbook match confident enough

Final Selection

disk_cleanup_prod_db

Confidence: 0.958 (95.8%)

Action: Execute autonomously (>0.90 threshold)

Estimated duration: 87 seconds (historical average)

Real Incident → Playbook Selection

Walk through a real incident matching flow: from raw metrics to autonomous remediation with full audit trail.

Incident Details

Raw Incident Data
{
  "incident_id": "inc_2026_02_10_1435",
  "timestamp": "2026-02-10T14:35:42.124Z",
  "host": "prod-db-03.us-east-1",
  "environment": "production",
  "service": "postgresql",
  "metric": "disk_usage",
  "current_value": 91.4,
  "baseline": 68.2,
  "deviation": 4.8,
  "severity": "critical"
}

Natural Language

“Production PostgreSQL database prod-db-03 in us-east-1 experiencing critical disk usage: 91.4% current vs 68.2% baseline, 4.8 standard deviations above normal.”

Top 3 Playbook Candidates

1disk_cleanup_prod_db
Selected

Semantic similarity: 0.94

Historical success: 97.9% (46/47 runs)

Constraints: ✓ host=*.db.*, service=postgresql, env=production

Last run: 24 hours ago (successful)

Avg duration: 87 seconds

Final confidence: 0.958

2disk_cleanup_general

Semantic similarity: 0.87

Historical success: 87.7% (178/203 runs)

Constraints: ✓ No constraints (universal)

Last run: 5 hours ago (successful)

Avg duration: 62 seconds

Final confidence: 0.854

3log_rotation_postgres

Semantic similarity: 0.82

Historical success: 92.1% (35/38 runs)

Constraints: ✓ service=postgresql

Last run: 3 days ago (successful)

Avg duration: 45 seconds

Final confidence: 0.821

What Gets Executed
name: disk_cleanup_prod_db
version: 1.4.2
steps:
  - name: clear_temp_files
    command: "find /tmp -type f -mtime +7 -delete"
  - name: rotate_logs
    command: "logrotate -f /etc/logrotate.conf"
  - name: verify_space_freed
    health_check: "disk_usage < 80%"

Selection Decision

Selected: disk_cleanup_prod_db

Reason: Highest confidence (0.958 > 0.90 threshold)

Action: Execute autonomously

Notification: Informational Slack message (not approval request)

Outcome

Execution time: 87 seconds

Disk usage: 91.4% → 72.1%

Health verification: PASS

Status: Resolved autonomously

Confidence Improves Over Time

New playbooks start conservative (human oversight) and earn autonomy through proven success. This prevents “AI running wild” while allowing automation to scale as confidence builds.

Confidence Score Over Time

0.60.70.80.91.0AutonomousApprovalEscalateW1W2W3W4W5W6W7W8W9W10
Week 1

First Deployment

Escalate to human

Total runs: 0

Success: N/A

Confidence: 68.4%

(0.6 × 0.89) + (0.3 × 0.50 assumed) + (0.1 × 0.0)

Result: Human reviews, approves manually, playbook succeeds

Week 2

Building Confidence

Approval required

Total runs: 3

Success: 100% (3/3)

Confidence: 88.4%

(0.6 × 0.89) + (0.3 × 1.00) + (0.1 × 0.5)

Result: Slack notification, human approves, playbook succeeds

Week 4

Approaching Autonomy

Still approval-required

Total runs: 12

Success: 91.7% (11/12)

Confidence: 88.9%

(0.6 × 0.89) + (0.3 × 0.917) + (0.1 × 0.8)

Result: Human approves 12 times, all successes

Week 8

Autonomous

Execute autonomously

Total runs: 47

Success: 97.9% (46/47)

Confidence: 92.8%

(0.6 × 0.89) + (0.3 × 0.979) + (0.1 × 1.0)

Result: No human approval needed, runs automatically

How Playbooks Are Stored and Matched

Every playbook is a YAML file with execution steps, metadata for RAG matching, and historical performance stats that update after each run.

Playbook Anatomy
# Playbook YAML file
name: disk_cleanup_prod_db
version: 1.4.2
description: |
  Clear disk space on production database servers by removing
  temporary files older than 7 days and rotating logs. Targets
  PostgreSQL servers experiencing disk usage >85%.

# Metadata for RAG matching
metadata:
  tags: ["disk", "cleanup", "database", "postgresql", "storage"]
  host_pattern: "*.db.*"
  service: "postgresql"
  environment: ["production"]
  severity: ["warning", "critical"]

# Vector embedding (computed at playbook creation)
embedding: [0.023, -0.891, 0.445, ..., 0.129]  # 1536 dimensions

# Historical performance (updated after each run)
performance:
  total_runs: 47
  successful: 46
  failed: 1
  success_rate: 0.979
  avg_duration_seconds: 87
  last_run: "2026-02-09T03:12:45Z"
  last_result: "success"

# Execution steps
steps:
  - name: clear_temp_files
    action: ssh_command
    command: "find /tmp -type f -mtime +7 -delete"
    timeout: 60s

  - name: rotate_logs
    action: ssh_command
    command: "logrotate -f /etc/logrotate.conf"
    timeout: 60s

  - name: verify_space_freed
    action: health_check
    metric: disk_usage
    threshold: "< 80%"
    retry: 3
    retry_delay: 10s

Library Organization

Playbook Library (vector database)

\u251c\u2500\u2500 Pre-built Playbooks (50+)

├── disk_cleanup_linux

├── disk_cleanup_prod_db

├── memory_restart_service

├── k8s_pod_restart

├── postgres_connection_reset

└── ssl_cert_renewal

\u251c\u2500\u2500 Custom Playbooks (unlimited)

├── custom_app_restart

└── custom_cache_clear

\u2514\u2500\u2500 Embeddings Index

├── Incident vectors → Playbook vectors

├── Similarity search <100ms

└── Supports 10,000+ playbooks

Search Performance

Library SizeSearch LatencyMemory Usage
50 playbooks<50ms10 MB
500 playbooks<75ms100 MB
1,000 playbooks<100ms200 MB
10,000 playbooks<200ms2 GB

Why RAG Outperforms Traditional Rules

165ms extra latency to avoid executing the wrong playbook is an excellent trade.

DimensionRules EngineRAG IntelligenceWinner
Matching MethodExact keyword matchSemantic similarityRAG
Context AwarenessNone (if metric=="X")Full (host, service, time, history)RAG
SynonymsFail (must match exactly)Handle automaticallyRAG
MaintenanceManual (update rules per change)Automatic (learns from embeddings)RAG
New PlaybooksWrite new rules for eachAuto-indexed, immediately searchableRAG
ConfidenceBinary (match or no match)Scored (0.0–1.0 confidence)RAG
LearningStatic (never improves)Dynamic (confidence increases)RAG
Selection SpeedFaster (10ms if/then)Slower (165ms embedding + search)Rules
Selection AccuracyLower (60–70% correct)Higher (90–95% correct)RAG
False PositivesHigh (wrong playbook executed)Low (low confidence = escalate)RAG
ScalabilityPoor (rules multiply)Excellent (vector search scales)RAG

Rules Engine Results (100 disk incidents)

Correct playbook62 incidents (62%)
Wrong playbook28 incidents (28%)
No match10 incidents (10%)
Average MTTR: 3.2 hours (includes fixing wrong playbook executions)

RAG Intelligence Results (100 disk incidents)

Correct playbook (>0.90 confidence)87 incidents (87%)
Approval-required (0.70\u20130.90)11 incidents (11%)
Escalated (<0.70)2 incidents (2%)
Average MTTR: 92 seconds autonomous (8 minutes approved)

RAG Pipeline Components

End-to-end architecture from incident detection to playbook dispatch. Every component optimized for production latency and reliability.

Incident Detection

  • Receives structured incident data from anomaly detection engine
  • Includes host, metric, value, baseline, deviation, environment, service

Natural Language Conversion

  • Converts structured JSON to human-readable incident description
  • Captures full semantic context for accurate embedding

OpenAI Embedding Model (text-embedding-3-large)

<50ms
  • Input: Natural language text
  • Output: 1536-dimension vector
  • Cost: ~$0.0001 per incident
  • Alternative: Self-hosted (sentence-transformers) for air-gapped

Vector Database (Pinecone / Weaviate / Qdrant)

<100ms
  • Index type: HNSW (Hierarchical Navigable Small World)
  • Distance metric: Cosine similarity
  • Returns: Top 5 most similar playbooks
  • Persistence: Disk-backed (survives restarts)

Context Filtering Engine

<15ms
  • Host pattern matching, environment restrictions
  • Historical success rate comparison
  • Time-of-day constraints
  • Caching: Recent incidents cached for 5 minutes

Confidence Scoring

<10ms
>0.90 Auto0.70–0.90 Approve<0.70 Escalate
  • Formula: 0.6×similarity + 0.3×success_rate + 0.1×recency
  • Thresholds configurable per organization
  • Override: Admins can force autonomous/approval per playbook

Selected Playbook Dispatched

Total: <175ms

Playbook name, confidence score, execution mode, and estimated duration sent to execution orchestrator.

Beyond Basic Matching

Multi-Metric Correlation

Problem: Single metric anomaly might not warrant playbook execution.

Solution: RAG considers multiple related metrics. If disk_usage is 91% but inode_usage, write_iops, and read_latency are all normal, confidence drops and approval is required instead of autonomous execution.

Disk high but performance normal → likely expected behavior → lower confidence (0.82), require approval

Time-of-Day Awareness

Problem: Same metric at different times can mean different root causes.

Solution: Embedding includes time context. CPU 85% at 2 PM matches traffic_surge_scaling (scale horizontally). CPU 85% at 3 AM matches background_job_throttle (slow down batch processing).

Same metric, different playbooks based on time of day

Incident Clustering

Problem: 5 servers with same issue = 1 root cause, not 5 separate incidents.

Solution: RAG clusters similar incidents. When prod-db-01, 02, and 03 all show disk 90%+ within 2 minutes, it identifies the shared root cause (e.g., shared NFS mount full) and executes one cluster-wide playbook.

3 servers alerted → cluster detected → 1 playbook execution, not 3

Negative Matching

Problem: Some playbooks should NEVER run on certain hosts or during certain windows.

Solution: Exclusion rules filter out dangerous playbooks before scoring. aggressive_cache_clear never runs on production, never during backup windows (00:00–06:00), and never during major outages (>5 concurrent incidents).

Excluded playbooks removed before confidence scoring

Common Questions

Confidence score falls below the 0.70 threshold. SentienGuard escalates to a human via Slack, email, or PagerDuty. You investigate manually, then create a new playbook for future occurrences. After 3–5 successful manual resolutions, RAG has enough confidence to start running autonomously.

Yes. Admins can manually trigger any playbook from the dashboard. Useful for testing new playbooks or handling edge cases. Manual triggers are still logged in the immutable audit trail.

Write a detailed description in YAML metadata. RAG embeds the description, so natural language clarity matters more than keywords. Good: "Clear disk space on PostgreSQL production databases by removing temporary files older than 7 days and rotating application logs. Use when disk usage exceeds 85% and database performance is unaffected." Bad: "Cleans disk".

Traditional ML detects anomalies (what's broken). RAG selects remediation (how to fix it). Different problems, complementary solutions. SentienGuard uses both: statistical ML for anomaly detection, RAG for playbook selection.

No. RAG only selects from your existing playbook library. It cannot invent new remediation steps. This is "Retrieval-Augmented" Generation—retrieval constrains output to real playbooks. If no playbook matches (confidence <0.70), RAG escalates rather than guessing.

Approximately $0.0001 per incident (negligible). For 10,000 incidents/month that's about $1/month in embedding cost. Platform cost ($4/node) includes AI service fees—no surprise bills.

Try RAG Intelligence

  1. 1Deploy SentienGuard agents on 3 nodes (free tier).
  2. 2Import pre-built playbook library (50+ playbooks, already embedded).
  3. 3Trigger test incident (fill disk to 90%).
  4. 4Watch RAG select playbook in <165ms.
  5. 5Review confidence score and execution results.

Free tier: 3 nodes, unlimited playbooks, full audit logs, no credit card required.