# Brittle threshold example
alert: HighDiskUsage
condition: disk_usage > 85%
action: page_engineerAnomaly Detection
Dynamic Baselines Beat Static Thresholds Every Time
Static thresholds ignore context and break during growth. Dynamic baselines learn your infrastructure's normal behavior using 7-day rolling averages with time-of-day patterns. Detect real anomalies, eliminate false positives, adapt automatically as you scale.
Why Hard-Coded Thresholds Break
Traditional monitoring uses static thresholds: “alert if disk >85%.” This optimizes for simplicity—easy to configure—but sacrifices accuracy. Static thresholds work for stable, unchanging infrastructure, which doesn't exist. Real infrastructure changes over time, varies by time of day, differs by context, and has seasonal patterns. Static thresholds can't handle any of this.
Failure Mode: Context Blindness
Static threshold: "Alert if disk >85%." But 85% disk on a production database at 2 PM is normal high-traffic volume. On the same database at 3 AM, 75% disk is an anomaly—backups not clearing temp files. On a dev server, 90% is expected. On a log aggregator, 70% is critical because log rotation failed. Same threshold, different contexts, wrong results every time. 85% isn't universally "high"—it depends on server type, time of day, and historical patterns.
Failure Mode: Growth Breaks Thresholds
Month 1: 100 servers, average CPU 40%. Month 6: 500 servers, average CPU 60% (more load, optimized code). Month 12: 1,000 servers, average CPU 75% (traffic growth). Static threshold at 80% works in Month 1 (40% average, 80% = rare spike) but fires constantly in Month 12 (75% average, 80% = normal variance). You tune thresholds for current state, then infrastructure grows, and thresholds become meaningless.
Failure Mode: Time-of-Day Ignorance
E-commerce sites peak 12 PM to 9 PM and go quiet 3 AM to 6 AM. Batch processing is quiet during the day and heavy at night. B2B SaaS peaks 9 AM to 5 PM and idles on weekends. Static thresholds can't encode time-of-day context. Disk at 87% at 2 PM during peak traffic is normal, but disk at 78% at 3 AM when baseline is 65% means log rotation failed—and the threshold won't catch it.
Failure Mode: Seasonal Patterns Ignored
Black Friday: 10× normal traffic. Summer slowdown: 50% normal traffic. New feature launch: gradual growth over 3 months. Marketing campaign: sudden 3× spike for 2 weeks. A static threshold of 10,000 requests/sec is wrong for Black Friday (normal is 50,000), wrong for summer (2,500 average, threshold never triggers), and only right for a brief window. Threshold is wrong 80% of the year.
2:00 PM - prod-db-03 disk usage: 87%
Static threshold: 85% exceeded → ALERT
Engineer paged: Investigates, finds normal traffic spike
Result: False positive, wasted 45 minutes
vs.
3:00 AM - prod-db-03 disk usage: 78%
Static threshold: Not exceeded → NO ALERT
Reality: Baseline is 65% at 3 AM, 78% is 4.3σ deviation
Result: Missed real anomaly, incident discovered hours laterJanuary: Set threshold at 80% CPU (works)
March: False positives increase, raise to 85%
June: Still too many alerts, raise to 90%
September: Missing real incidents, lower to 87%
December: Alert fatigue, engineers ignore all CPU alerts
Result: Threshold tuning is a full-time jobThreshold: disk_usage > 85%
2 PM (peak traffic):
- Disk 87% = normal (lots of temp files from requests)
- Alert fires = FALSE POSITIVE
3 AM (off-peak):
- Disk 78% = anomaly (should be 65%, log rotation failed)
- Alert doesn't fire = MISSED INCIDENTThreshold: request_rate > 10,000/sec
Normal day: 5,000/sec average, 10,000/sec = real spike
Black Friday: 50,000/sec average, 10,000/sec = impossibly low
Summer: 2,500/sec average, 10,000/sec = never triggers (too high)
Result: Threshold is wrong 80% of the yearThe Core Problem
Static thresholds produce a 60-70% false positive rate and miss 20-30% of real incidents. You end up with two bad choices: set thresholds tight (alert fatigue from false positives, engineers ignore alerts) or set thresholds loose (miss real incidents, discover failures hours later). Neither works. You need dynamic baselines.
60-70%
False positive rate
20-30%
Missed real incidents
Week 12
Alert fatigue sets in
Statistical Learning Replaces Hard-Coded Thresholds
Dynamic baselines learn what “normal” looks like for each metric, on each host, at each time of day. Instead of “disk >85%”, you get “disk is 4.8 standard deviations above expected for this host at this time.”
Day 1-7: Initial Learning
7 daysSystem collects metrics every 30 seconds—20,160 data points per metric over 7 days. Calculates mean, standard deviation, and time-of-day patterns. Different baselines computed for each hour of the day, each host, each metric.
Day 8+: Anomaly Detection
Real-timeEvery new metric is compared against the baseline for that specific hour. Deviation measured in standard deviations, not arbitrary percentages. 9.1σ deviation on disk usage at 2 PM = CRITICAL anomaly. 0.6σ deviation = normal variance, no action needed.
Continuous: Rolling Window
Daily updateBaseline updates every day: drop oldest day, add newest day, recalculate mean and standard deviation. Gradual infrastructure changes absorbed into the baseline automatically. Sudden spikes don't pollute the baseline because they're one day out of seven.
System collects metrics every 30 seconds:
- prod-db-03, disk_usage: 68.2%, 69.1%, 67.8%, 68.9%, ...
- Stores 7 days of data points (20,160 data points per metric)
Calculates baseline:
- Mean: 68.4%
- Standard deviation: 2.1%
- Time-of-day patterns: Higher during business hours (72%), lower at night (65%)
Result: Baseline established for prod-db-03 disk_usageNew metric arrives: disk_usage = 91.4% at 2:47 PM
Statistical analysis:
- Expected at 2:47 PM: 72.3% (business hours baseline)
- Standard deviation: 2.1%
- Deviation: (91.4 - 72.3) / 2.1 = 9.1 standard deviations
- Threshold: >2σ = anomaly
Decision: 9.1σ = CRITICAL ANOMALY (far beyond normal variance)
Action: Trigger playbook selectionMonth 1 baseline: CPU 40% ± 5%
Month 3 baseline: CPU 55% ± 6% (gradual growth, baseline adapts)
Month 6 baseline: CPU 65% ± 7% (continued growth, still adapting)
Static threshold would break. Dynamic baseline just works.For each metric, at each time-of-day:
baseline_mean = average(last_7_days_same_hour)
baseline_stddev = stddev(last_7_days_same_hour)
normal_range = baseline_mean ± 2σ
If current_value outside normal_range:
deviation = (current_value - baseline_mean) / baseline_stddev
if deviation > 2σ: WARNING
if deviation > 3σ: HIGH
if deviation > 4σ: CRITICALWhy This Works
Context-aware
Different baselines for prod vs. dev, database vs. web server
Time-aware
Different baselines for 2 PM vs. 3 AM
Adaptive
Baselines shift as infrastructure grows or changes
Statistical
Uses standard deviations, not arbitrary percentages
Self-tuning
No manual threshold updates needed
Why 75% Disk at 2 PM ≠ 75% Disk at 3 AM
Same metric value means different things at different times. SentienGuard learns time-of-day patterns automatically so that peak-hour traffic doesn't trigger false alarms and off-hour anomalies don't go undetected.
Disk Usage Over 24 Hours (prod-web-01)
100% ┤
│ ╭──────╮ ← False Positive
85% ┼─────────── █──────█ ────────────── (threshold)
│ ╭───╯ ╰───╮
70% ┤ ╭───╯ ╰───╮
│╭───╯ ╰───╮
50% ┼╯ ╰───
└┬────┬────┬────┬────┬────┬────┬───
6AM 9AM 12PM 3PM 6PM 9PM 12AM
Analysis:
- 2 PM peak: 87% disk (normal traffic spike)
- Threshold 85% exceeded → ALERT
- Engineer investigates: False alarm
- Result: Wasted 30 minutesDisk Usage with Time-of-Day Baseline
100% ┤
│ ╭──────╮
85% ┤ ╭───█──────█───╮ ← Baseline adjusts
│ ╭───╯──╯ ╰──╰───╮ ← Normal range
70% ┼────╯ ╰────
│╭──╯ ╰──╮
50% ┼╯ ╰─────
└┬────┬────┬────┬────┬────┬────┬───
6AM 9AM 12PM 3PM 6PM 9PM 12AM
Analysis:
- 2 PM peak: 87% disk
- Expected: 85% ± 3% (2 PM business hours baseline)
- Deviation: 0.67σ (well within normal)
- Result: No alert, no wasted timeThe Pattern Library
Pattern 1: Business Hours Peak (B2B SaaS)
Baseline by hour (CPU usage):
- 12 AM - 6 AM: 20% ± 5% (overnight batch jobs only)
- 6 AM - 9 AM: 40% ± 8% (morning login surge)
- 9 AM - 5 PM: 75% ± 10% (business hours traffic)
- 5 PM - 9 PM: 55% ± 12% (evening taper)
- 9 PM - 12 AM: 30% ± 6% (minimal traffic)
Same server, different baselines every hourPattern 2: Batch Processing at Night
Baseline by hour (CPU usage):
- 9 AM - 9 PM: 30% ± 5% (low during day)
- 9 PM - 12 AM: 45% ± 8% (starting batch jobs)
- 12 AM - 6 AM: 95% ± 3% (full batch processing)
- 6 AM - 9 AM: 60% ± 10% (finishing jobs)
95% CPU at 3 AM = normal (batch processing)
95% CPU at 3 PM = critical anomaly (something's wrong)Pattern 3: E-Commerce Peak Hours
Baseline by hour (request rate):
- 12 AM - 9 AM: 500/sec ± 100 (overnight, low traffic)
- 9 AM - 12 PM: 2,000/sec ± 300 (morning shopping)
- 12 PM - 3 PM: 5,000/sec ± 800 (lunch break peak)
- 3 PM - 6 PM: 3,000/sec ± 500 (afternoon taper)
- 6 PM - 9 PM: 4,000/sec ± 700 (evening shopping)
- 9 PM - 12 AM: 1,500/sec ± 250 (late night)
4,000 requests/sec at 8 PM = normal
4,000 requests/sec at 4 AM = 8× baseline = critical# Pseudocode: Time-of-day baseline calculation
def calculate_baseline(metric, host, current_hour):
# Get last 7 days of data for this hour
historical_data = get_metric_data(
metric=metric,
host=host,
hour=current_hour, # e.g., 14 (2 PM)
last_n_days=7
)
# Calculate statistics
baseline_mean = mean(historical_data)
baseline_stddev = stddev(historical_data)
# Define normal range
normal_min = baseline_mean - (2 * baseline_stddev)
normal_max = baseline_mean + (2 * baseline_stddev)
return {
"mean": baseline_mean,
"stddev": baseline_stddev,
"normal_range": [normal_min, normal_max]
}
# Example output for prod-db-03, disk_usage, 2 PM:
# { "mean": 72.3, "stddev": 2.1, "normal_range": [68.1, 76.5] }# Pseudocode: Anomaly detection
def detect_anomaly(current_value, baseline):
deviation = (current_value - baseline.mean) / baseline.stddev
if deviation > 4.0:
return "CRITICAL" # >4σ
elif deviation > 3.0:
return "HIGH" # 3-4σ
elif deviation > 2.0:
return "WARNING" # 2-3σ
else:
return "NORMAL" # <2σ
# Example: disk_usage = 91.4% at 2 PM
# baseline.mean = 72.3%, baseline.stddev = 2.1%
# (91.4 - 72.3) / 2.1 = 9.1σ → "CRITICAL"Baselines Adapt as Infrastructure Evolves
Rolling 7-day windows prevent baseline staleness. Gradual infrastructure growth is absorbed automatically. Sudden spikes are detected correctly. One-time events like Black Friday are forgotten after 7 days.
Traffic Growth Over 6 Months
Month 1
5,000 req/sec ± 800
Reality: 5,200 avg
Normal (1σ)
Month 2
6,500 req/sec ± 1,000
Reality: 6,800 avg
Normal (1σ)
Month 4
9,200 req/sec ± 1,400
Reality: 9,500 avg
Normal (1σ)
Month 6
12,000 req/sec ± 1,800
Reality: 12,300 avg
Normal (1σ)
Static threshold would have broken by Month 2.
Week 1 baseline: CPU 40% (calculated from Days 1-7)
Week 52 baseline: Still 40% (never updated)
Reality: Infrastructure grew, optimized, changed
Actual CPU Week 52: 70% average
Result: Every metric triggers as anomaly (70% vs 40% baseline = 14σ!)Day 1-7: Baseline calculated from Days 1-7
Day 8: Baseline calculated from Days 2-8 (drop Day 1, add Day 8)
Day 9: Baseline calculated from Days 3-9 (drop Day 2, add Day 9)
...
Day 365: Baseline calculated from Days 359-365
Baseline continuously adapts to infrastructure changesGradual vs. Sudden Changes
Day 1: CPU 40%
Day 2: CPU 41%
Day 3: CPU 42%
...
Day 90: CPU 65%
Change: +0.28% per day (gradual)
Baseline: Tracks smoothly (40% → 42% → 45% → ... → 65%)
Result: No false alarms, baseline adaptsDay 1-89: CPU 40% ± 3%
Day 90: CPU 85% (sudden spike)
Change: +45% in one day (sudden)
Baseline: Still ~40% (7-day window includes Days 83-89, all ~40%)
Deviation: (85 - 40) / 3 = 15σ
Result: CRITICAL anomaly detected (correct!)Nov 20-23: Normal traffic (5,000 req/sec baseline)
Nov 24 (Black Friday): 50,000 req/sec (10× spike)
Nov 25-30: Back to normal (5,000 req/sec)
Baseline during Black Friday week:
Days: Nov 18, 19, 20, 21, 22, 23, 24
Values: 5K, 5K, 5K, 5K, 5K, 5K, 50K
Mean: 11.4K (inflated by one spike day)
Baseline 7 days later (Dec 1):
Days: Nov 25, 26, 27, 28, 29, 30, Dec 1
Values: 5K, 5K, 5K, 5K, 5K, 5K, 5K
Mean: 5K (back to normal, spike forgotten)
Result: Black Friday spike doesn't pollute baseline forever# Weighted calculation (more recent = more important)
weights = [0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 2.0] # Day 1-7
weighted_mean = sum(value[i] * weight[i] for i in range(7)) / sum(weights)
# Example:
values = [40, 41, 42, 43, 44, 45, 70] # CPU % over 7 days
weights = [0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 2.0]
simple_mean = 46.4% # Equal weight
weighted_mean = 52.1% # Recent spike (Day 7 = 70%) weighted higher
Result: Baseline adapts faster to recent changesHow Dynamic Baselines Eliminate Alert Fatigue
Three techniques reduce false positives from 60-70% to under 10%: persistence filtering, known event exclusions, and multi-metric correlation.
Static thresholds — 100 alerts per week:
- 60-70 false positives (engineer investigates, finds nothing wrong)
- 20-30 real incidents (require action)
- 5-10 missed incidents (threshold too loose)
Engineer response:
Week 1: Investigates all 100 alerts diligently
Week 4: Starts ignoring "probably false" alerts
Week 8: Ignores 80% of alerts (alert fatigue sets in)
Week 12: Misses critical incident (lost in noise)
Result: Alerts become uselessDynamic baselines — 100 incidents detected per week:
- 5-10 false positives (5-10% rate, down from 60-70%)
- 85-90 real anomalies (true positive rate: 90%)
- 2-3 missed incidents (sensitivity: 97%)
Engineer response:
Every alert is probably real
Engineers trust the system
Alerts get investigated promptly
Real incidents caught early
Result: Alerts remain usefulPersistence Filtering
Problem
Single data point spikes cause false alarms.
Solution
Anomaly must persist for 2+ minutes (4+ consecutive samples). Transient spikes—disk write bursts, CPU microbursts—ignored automatically. Real sustained anomalies always detected.
Known Event Exclusions
Problem
Scheduled maintenance triggers false alarms every time.
Solution
Configure known events (cron schedules) that suppress alerts during maintenance windows. Database backups every Sunday 2 AM, monthly batch jobs—all excluded automatically. Real anomalies outside maintenance windows still detected.
Multi-Metric Correlation
Problem
Single metric anomaly might be harmless.
Solution
Cross-check correlated metrics before escalating. Disk at 91% alone = lower confidence. Disk at 91% + rising error rate + high IOPS = high confidence. Correlated anomalies get autonomous execution; isolated anomalies require approval.
Disk usage samples (every 30s):
72%, 73%, 91%, 74%, 72%, 73%
↑
One spike (temporary write, then deleted)
Static threshold: Triggers immediately on 91%
Dynamic baseline with persistence:
Sample 1: 91% (anomaly detected)
Sample 2: 74% (back to normal)
Duration: 30 seconds (< 2 minutes required)
Action: Ignore (transient spike)
Sustained anomaly:
Sample 1-6: 91%, 92%, 90%, 91%, 93%, 91%
Duration: 3 minutes (> 2 minutes required)
Action: Trigger playbook (real incident)def detect_persistent_anomaly(metric_history,
threshold_sigma=2.0,
duration_seconds=120):
anomaly_start = None
for sample in metric_history:
if sample.deviation > threshold_sigma:
if anomaly_start is None:
anomaly_start = sample.timestamp
else:
duration = sample.timestamp - anomaly_start
if duration >= duration_seconds:
return True # Persistent anomaly
else:
anomaly_start = None # Reset if back to normal
return False # No persistent anomaly# Known events configuration
known_events:
- name: weekly_database_backup
schedule: "0 2 * * 0" # Cron: Every Sunday 2 AM
duration: 120 # minutes
suppress_alerts:
- disk_usage
- cpu_usage
- network_throughput
hosts:
- "*.db.*" # All database serversdef should_suppress_alert(metric, host, timestamp):
for event in known_events:
if event.matches(host) and event.is_active(timestamp):
if metric in event.suppress_alerts:
return True # Suppress, known event
return False # Don't suppressPrimary anomaly: disk_usage = 91%
Check correlated metrics:
- inode_usage: 45% (normal, not exhausted)
- disk_write_iops: 1,200 (normal, not thrashing)
- disk_read_latency: 3ms (normal, no performance degradation)
- application_errors: 0 (normal, no failures)
- tcp_connections: 150 (normal, services healthy)
Interpretation:
Disk space is high, but performance is normal.
Likely cause: Legitimate large file write.
Action: Lower confidence (0.75 → requires approval vs. autonomous)# Multi-metric correlation
correlation_rules:
disk_usage_critical:
primary: disk_usage > baseline + 4σ
correlations:
- metric: inode_usage
threshold: > 90%
weight: 0.3
- metric: disk_write_iops
threshold: > baseline + 3σ
weight: 0.2
- metric: application_errors
threshold: > 0
weight: 0.5
confidence_boost: 0.2 # If correlations match, increase confidenceFrom Warning to Critical Based on Deviation
Four severity tiers determined by statistical deviation. Higher deviation = rarer event = faster response. Critical anomalies (>4σ) trigger autonomous resolution.
Normal
0-2σ~95% of samplesCPU 68% (baseline 65% ± 5%) → 0.6σ deviation
Response
No action needed. Expected variance within normal distribution.
Warning
2-3σ~2-5% of samplesMemory 9.3 GB (baseline 8.2 GB ± 0.5 GB) → 2.2σ deviation
Response
Incident logged in dashboard. No playbook execution yet. If persists >10 minutes → escalate to HIGH.
High
3-4σ~0.1-0.5% of samplesDisk 74% (baseline 68% ± 2%) → 3.0σ deviation
Response
RAG selects playbook. Approval required via Slack notification. Human confirms before execution. Typical for first-time incidents.
Critical
>4σ<0.01% of samplesDisk 91% (baseline 68% ± 2%) → 11.5σ deviation
Response
RAG selects playbook. If confidence >0.90: execute autonomously. If 0.70-0.90: fast-track Slack approval. If <0.70: escalate to human.
| Severity | Deviation | Frequency | Response | Approval |
|---|---|---|---|---|
| Normal | 0-2σ | 95% | None | N/A |
| Warning | 2-3σ | 2-5% | Log only | N/A |
| High | 3-4σ | 0.1-0.5% | Playbook (approval) | Required |
| Critical | >4σ | <0.01% | Playbook (autonomous) | Optional |
# Customizable per organization
anomaly_thresholds:
warning: 2.0 # standard deviations
high: 3.0
critical: 4.0
persistence_required:
warning: 300 # 5 minutes
high: 120 # 2 minutes
critical: 60 # 1 minute (faster response for critical)When Baselines Need Manual Intervention
Automatic rolling windows handle most changes. For major infrastructure shifts, two additional strategies provide control: manual reset and weighted baselines.
Automatic Rolling Window (Default)
7-day rolling average, updated daily. No manual intervention. Adapts to gradual changes automatically.
When It Works
Steady traffic growth, infrastructure optimization, seasonal patterns, normal operations.
Limitations
Major architecture change (microservices migration), platform migration (AWS → GCP), traffic pattern shift (B2B → B2C pivot).
Manual Baseline Reset
Dashboard → Anomaly Detection → Reset Baseline. Reset for specific host, host pattern, metric, or all. Starts fresh 7-day learning.
When It Works
Major infrastructure changes, platform migrations, architecture overhauls, traffic pattern shifts.
Limitations
7-day blind period without anomaly detection. Mitigate with temporary static thresholds.
Weighted Baseline (Prefer Recent Data)
Weight recent days higher than older days (2.0× for today, 0.5× for 7 days ago). Baseline adapts faster. Exponential decay option available.
When It Works
Rapid infrastructure changes, frequent deployments, A/B testing affecting metrics.
Limitations
Reduces statistical stability. Recent outliers have more influence.
Before migration: Monolith, CPU 40% baseline
After migration: Microservices, CPU 70% baseline
Problem: Rolling window takes 7 days to adapt
Result: 7 days of false alarms (70% vs 40% = constant anomalies)
Solution: Manual baseline reset
Dashboard → Anomaly Detection → Reset Baseline
→ Discards old baseline, starts fresh 7-day learning# Temporary static thresholds during reset
temporary_thresholds:
enabled: true
duration: 7 days
thresholds:
cpu_usage: > 90%
memory_usage: > 95%
disk_usage: > 85%CPU usage last 7 days: [40%, 42%, 44%, 46%, 48%, 50%, 65%]
Standard average: 47.9%
Weighted average: 53.6% (Day 7's 65% weighted higher)
Current: 68%
Deviation (standard): (68 - 47.9) / stddev = 4.2σ → CRITICAL
Deviation (weighted): (68 - 53.6) / stddev = 2.8σ → HIGH
Result: Weighted baseline reduces false alarm severity# Enable weighted baseline
baseline_calculation:
method: weighted_rolling_average
weights: [0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 2.0] # Day 1-7
# Or use exponential decay
baseline_calculation:
method: exponential_decay
decay_factor: 0.85 # Recent data 15% more important each day backFrom Anomaly Detection to Autonomous Resolution
Five-stage pipeline: detect anomaly, enrich context, select playbook, execute, verify. Total latency: detection (100ms) + RAG (165ms) + execution (87s) + verification (10s) = 97 seconds from anomaly to resolution.
Stage 1: Anomaly Detected
<100msMetric arrives: prod-db-03, disk_usage = 91.4%. Baseline: 68.2% ± 2.1%. Deviation: (91.4 - 68.2) / 2.1 = 11.0σ. Severity: CRITICAL (>4σ).
Stage 2: Context Enrichment
<50msAnomaly enriched with host metadata (prod-db-03.us-east-1, production, postgresql), time context (2:47 PM, business hours), and correlated metrics (inode 45% normal, write IOPS normal, errors = 0). Interpretation: disk full, services still healthy.
Stage 3: RAG Playbook Selection
~165msEnriched context → RAG → Top 3 playbooks: disk_cleanup_prod_db (0.94), disk_cleanup_general (0.87), log_rotation_postgres (0.82). Selected: disk_cleanup_prod_db (highest confidence).
Stage 4: Execution Decision
87 secondsConfidence 0.94 exceeds 0.90 threshold. Severity CRITICAL. Decision: execute autonomously. Playbook dispatched to agent on prod-db-03. Steps: clear temp files, rotate logs, verify disk returned to normal range.
Stage 5: Verification & Closure
~10sExecution complete. Post-execution: disk_usage = 72.1% (down from 91.4%). Deviation: 1.9σ (back to normal). Incident closed: autonomous resolution successful.
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Anomaly Detection │
│ Disk 91.4% (11.0σ above baseline) → CRITICAL │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Context Enrichment │
│ + Host type, environment, time, correlated metrics │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 3: RAG Playbook Selection │
│ disk_cleanup_prod_db selected (0.94 confidence) │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 4: Autonomous Execution │
│ 0.94 > 0.90 → Execute without approval │
│ Duration: 87 seconds │
└──────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 5: Verification │
│ Disk 72.1% (1.9σ, back to normal) → Incident closed │
└─────────────────────────────────────────────────────────┘End-to-End Latency Breakdown
100ms
Detection
50ms
Enrichment
165ms
RAG Selection
87s
Execution
10s
Verification
Total: ~97 seconds from anomaly to resolution
Common Questions
7 days of data required for statistical confidence. During first 7 days, SentienGuard uses conservative static thresholds (CPU >95%, Disk >90%, Memory >95%) to avoid false alarms. After 7 days, dynamic baselines activate automatically.
Rolling 7-day window adapts automatically to gradual changes. For sudden shifts (B2B → B2C pivot), manually reset baseline via dashboard. Takes 7 days to relearn new normal. During reset, use temporary static thresholds to avoid blind period.
Yes. Dashboard → Settings → Anomaly Detection → Threshold. Options: 1.5σ (sensitive, more alerts), 2.0σ (default, balanced), 2.5σ (conservative, fewer alerts), 3.0σ (only extreme anomalies). Applies organization-wide or per-host-pattern.
Reset baseline manually. Example: Baseline learned during DDoS attack (abnormally high traffic) → baseline inflated → real incidents missed. Solution: Dashboard → Reset Baseline → Fresh 7-day learning from normal operations.
SentienGuard uses statistical learning (mean, standard deviation, rolling windows), not deep learning neural networks. Benefits: Explainable (know why alert fired), fast (<100ms), low resource usage, no training data required. Trade-off: Simpler models, can't detect complex multi-variate patterns that ML might catch.
Yes. Dashboard → Anomaly Detection → Exclusions → Add Time Range. Specify date range to exclude from baseline calculation. Those days' data discarded, don't influence baseline. Useful for one-time events like Black Friday or product launches that shouldn't define "normal."
See Dynamic Baselines in Action
Deploy agents, watch 7-day baseline learning, trigger test incident, see severity classification, observe autonomous resolution.
Free tier: 3 nodes, 7-day baseline learning, full anomaly detection, no credit card.