SentienGuard
Home>Product>Anomaly Detection

Anomaly Detection

Dynamic Baselines Beat Static Thresholds Every Time

Static thresholds ignore context and break during growth. Dynamic baselines learn your infrastructure's normal behavior using 7-day rolling averages with time-of-day patterns. Detect real anomalies, eliminate false positives, adapt automatically as you scale.

7 daysBaseline learning windowRolling average with decay
2-4σAnomaly threshold rangeStandard deviations from baseline
<100msDetection latencyReal-time statistical analysis

Why Hard-Coded Thresholds Break

Traditional monitoring uses static thresholds: “alert if disk >85%.” This optimizes for simplicity—easy to configure—but sacrifices accuracy. Static thresholds work for stable, unchanging infrastructure, which doesn't exist. Real infrastructure changes over time, varies by time of day, differs by context, and has seasonal patterns. Static thresholds can't handle any of this.

Brittle Threshold Example
# Brittle threshold example
alert: HighDiskUsage
condition: disk_usage > 85%
action: page_engineer

Failure Mode: Context Blindness

Static threshold: "Alert if disk >85%." But 85% disk on a production database at 2 PM is normal high-traffic volume. On the same database at 3 AM, 75% disk is an anomaly—backups not clearing temp files. On a dev server, 90% is expected. On a log aggregator, 70% is critical because log rotation failed. Same threshold, different contexts, wrong results every time. 85% isn't universally "high"—it depends on server type, time of day, and historical patterns.

Failure Mode: Growth Breaks Thresholds

Month 1: 100 servers, average CPU 40%. Month 6: 500 servers, average CPU 60% (more load, optimized code). Month 12: 1,000 servers, average CPU 75% (traffic growth). Static threshold at 80% works in Month 1 (40% average, 80% = rare spike) but fires constantly in Month 12 (75% average, 80% = normal variance). You tune thresholds for current state, then infrastructure grows, and thresholds become meaningless.

Failure Mode: Time-of-Day Ignorance

E-commerce sites peak 12 PM to 9 PM and go quiet 3 AM to 6 AM. Batch processing is quiet during the day and heavy at night. B2B SaaS peaks 9 AM to 5 PM and idles on weekends. Static thresholds can't encode time-of-day context. Disk at 87% at 2 PM during peak traffic is normal, but disk at 78% at 3 AM when baseline is 65% means log rotation failed—and the threshold won't catch it.

Failure Mode: Seasonal Patterns Ignored

Black Friday: 10× normal traffic. Summer slowdown: 50% normal traffic. New feature launch: gradual growth over 3 months. Marketing campaign: sudden 3× spike for 2 weeks. A static threshold of 10,000 requests/sec is wrong for Black Friday (normal is 50,000), wrong for summer (2,500 average, threshold never triggers), and only right for a brief window. Threshold is wrong 80% of the year.

Context Blindness: Real Incident
2:00 PM - prod-db-03 disk usage: 87%
Static threshold: 85% exceeded → ALERT
Engineer paged: Investigates, finds normal traffic spike
Result: False positive, wasted 45 minutes

vs.

3:00 AM - prod-db-03 disk usage: 78%
Static threshold: Not exceeded → NO ALERT
Reality: Baseline is 65% at 3 AM, 78% is 4.3σ deviation
Result: Missed real anomaly, incident discovered hours later
The Re-Tuning Treadmill
January:  Set threshold at 80% CPU (works)
March:    False positives increase, raise to 85%
June:     Still too many alerts, raise to 90%
September: Missing real incidents, lower to 87%
December: Alert fatigue, engineers ignore all CPU alerts
Result: Threshold tuning is a full-time job
Time-of-Day Ignorance
Threshold: disk_usage > 85%

2 PM (peak traffic):
  - Disk 87% = normal (lots of temp files from requests)
  - Alert fires = FALSE POSITIVE

3 AM (off-peak):
  - Disk 78% = anomaly (should be 65%, log rotation failed)
  - Alert doesn't fire = MISSED INCIDENT
Seasonal Blindness
Threshold: request_rate > 10,000/sec

Normal day: 5,000/sec average, 10,000/sec = real spike
Black Friday: 50,000/sec average, 10,000/sec = impossibly low
Summer: 2,500/sec average, 10,000/sec = never triggers (too high)

Result: Threshold is wrong 80% of the year

The Core Problem

Static thresholds produce a 60-70% false positive rate and miss 20-30% of real incidents. You end up with two bad choices: set thresholds tight (alert fatigue from false positives, engineers ignore alerts) or set thresholds loose (miss real incidents, discover failures hours later). Neither works. You need dynamic baselines.

60-70%

False positive rate

20-30%

Missed real incidents

Week 12

Alert fatigue sets in

Statistical Learning Replaces Hard-Coded Thresholds

Dynamic baselines learn what “normal” looks like for each metric, on each host, at each time of day. Instead of “disk >85%”, you get “disk is 4.8 standard deviations above expected for this host at this time.”

1

Day 1-7: Initial Learning

7 days

System collects metrics every 30 seconds—20,160 data points per metric over 7 days. Calculates mean, standard deviation, and time-of-day patterns. Different baselines computed for each hour of the day, each host, each metric.

2

Day 8+: Anomaly Detection

Real-time

Every new metric is compared against the baseline for that specific hour. Deviation measured in standard deviations, not arbitrary percentages. 9.1σ deviation on disk usage at 2 PM = CRITICAL anomaly. 0.6σ deviation = normal variance, no action needed.

3

Continuous: Rolling Window

Daily update

Baseline updates every day: drop oldest day, add newest day, recalculate mean and standard deviation. Gradual infrastructure changes absorbed into the baseline automatically. Sudden spikes don't pollute the baseline because they're one day out of seven.

Day 1-7: Initial Learning
System collects metrics every 30 seconds:
- prod-db-03, disk_usage: 68.2%, 69.1%, 67.8%, 68.9%, ...
- Stores 7 days of data points (20,160 data points per metric)

Calculates baseline:
- Mean: 68.4%
- Standard deviation: 2.1%
- Time-of-day patterns: Higher during business hours (72%), lower at night (65%)

Result: Baseline established for prod-db-03 disk_usage
Day 8+: Anomaly Detection
New metric arrives: disk_usage = 91.4% at 2:47 PM

Statistical analysis:
- Expected at 2:47 PM: 72.3% (business hours baseline)
- Standard deviation: 2.1%
- Deviation: (91.4 - 72.3) / 2.1 = 9.1 standard deviations
- Threshold: >2σ = anomaly

Decision: 9.1σ = CRITICAL ANOMALY (far beyond normal variance)
Action: Trigger playbook selection
Infrastructure Growth Adaptation
Month 1 baseline: CPU 40% ± 5%
Month 3 baseline: CPU 55% ± 6% (gradual growth, baseline adapts)
Month 6 baseline: CPU 65% ± 7% (continued growth, still adapting)

Static threshold would break. Dynamic baseline just works.
The Math
For each metric, at each time-of-day:

baseline_mean = average(last_7_days_same_hour)
baseline_stddev = stddev(last_7_days_same_hour)
normal_range = baseline_mean ± 2σ

If current_value outside normal_range:
  deviation = (current_value - baseline_mean) / baseline_stddev

  if deviation > 2σ: WARNING
  if deviation > 3σ: HIGH
  if deviation > 4σ: CRITICAL

Why This Works

Context-aware

Different baselines for prod vs. dev, database vs. web server

Time-aware

Different baselines for 2 PM vs. 3 AM

Adaptive

Baselines shift as infrastructure grows or changes

Statistical

Uses standard deviations, not arbitrary percentages

Self-tuning

No manual threshold updates needed

Why 75% Disk at 2 PM ≠ 75% Disk at 3 AM

Same metric value means different things at different times. SentienGuard learns time-of-day patterns automatically so that peak-hour traffic doesn't trigger false alarms and off-hour anomalies don't go undetected.

Static Threshold (Fails)
Static: False Positive at Peak
Disk Usage Over 24 Hours (prod-web-01)
100% ┤
     │            ╭──────╮               ← False Positive
 85% ┼─────────── █──────█ ──────────────  (threshold)
     │        ╭───╯      ╰───╮
 70% ┤    ╭───╯              ╰───╮
     │╭───╯                      ╰───╮
 50% ┼╯                              ╰───
     └┬────┬────┬────┬────┬────┬────┬───
      6AM  9AM  12PM 3PM  6PM  9PM  12AM

Analysis:
- 2 PM peak: 87% disk (normal traffic spike)
- Threshold 85% exceeded → ALERT
- Engineer investigates: False alarm
- Result: Wasted 30 minutes
Dynamic Baseline (Works)
Dynamic: Context-Aware Detection
Disk Usage with Time-of-Day Baseline
100% ┤
     │            ╭──────╮
 85% ┤        ╭───█──────█───╮  ← Baseline adjusts
     │    ╭───╯──╯      ╰──╰───╮ ← Normal range
 70% ┼────╯                  ╰────
     │╭──╯                      ╰──╮
 50% ┼╯                            ╰─────
     └┬────┬────┬────┬────┬────┬────┬───
      6AM  9AM  12PM 3PM  6PM  9PM  12AM

Analysis:
- 2 PM peak: 87% disk
- Expected: 85% ± 3% (2 PM business hours baseline)
- Deviation: 0.67σ (well within normal)
- Result: No alert, no wasted time

The Pattern Library

Pattern 1: Business Hours Peak (B2B SaaS)

B2B SaaS Pattern
Baseline by hour (CPU usage):
- 12 AM - 6 AM:  20% ± 5%   (overnight batch jobs only)
- 6 AM - 9 AM:   40% ± 8%   (morning login surge)
- 9 AM - 5 PM:   75% ± 10%  (business hours traffic)
- 5 PM - 9 PM:   55% ± 12%  (evening taper)
- 9 PM - 12 AM:  30% ± 6%   (minimal traffic)

Same server, different baselines every hour

Pattern 2: Batch Processing at Night

Batch Processing Pattern
Baseline by hour (CPU usage):
- 9 AM - 9 PM:   30% ± 5%   (low during day)
- 9 PM - 12 AM:  45% ± 8%   (starting batch jobs)
- 12 AM - 6 AM:  95% ± 3%   (full batch processing)
- 6 AM - 9 AM:   60% ± 10%  (finishing jobs)

95% CPU at 3 AM = normal (batch processing)
95% CPU at 3 PM = critical anomaly (something's wrong)

Pattern 3: E-Commerce Peak Hours

E-Commerce Pattern
Baseline by hour (request rate):
- 12 AM - 9 AM:  500/sec ± 100    (overnight, low traffic)
- 9 AM - 12 PM:  2,000/sec ± 300  (morning shopping)
- 12 PM - 3 PM:  5,000/sec ± 800  (lunch break peak)
- 3 PM - 6 PM:   3,000/sec ± 500  (afternoon taper)
- 6 PM - 9 PM:   4,000/sec ± 700  (evening shopping)
- 9 PM - 12 AM:  1,500/sec ± 250  (late night)

4,000 requests/sec at 8 PM = normal
4,000 requests/sec at 4 AM = 8× baseline = critical
Baseline Calculation
# Pseudocode: Time-of-day baseline calculation
def calculate_baseline(metric, host, current_hour):
    # Get last 7 days of data for this hour
    historical_data = get_metric_data(
        metric=metric,
        host=host,
        hour=current_hour,  # e.g., 14 (2 PM)
        last_n_days=7
    )

    # Calculate statistics
    baseline_mean = mean(historical_data)
    baseline_stddev = stddev(historical_data)

    # Define normal range
    normal_min = baseline_mean - (2 * baseline_stddev)
    normal_max = baseline_mean + (2 * baseline_stddev)

    return {
        "mean": baseline_mean,
        "stddev": baseline_stddev,
        "normal_range": [normal_min, normal_max]
    }

# Example output for prod-db-03, disk_usage, 2 PM:
# { "mean": 72.3, "stddev": 2.1, "normal_range": [68.1, 76.5] }
Detection Logic
# Pseudocode: Anomaly detection
def detect_anomaly(current_value, baseline):
    deviation = (current_value - baseline.mean) / baseline.stddev

    if deviation > 4.0:
        return "CRITICAL"  # >4σ
    elif deviation > 3.0:
        return "HIGH"      # 3-4σ
    elif deviation > 2.0:
        return "WARNING"   # 2-3σ
    else:
        return "NORMAL"    # <2σ

# Example: disk_usage = 91.4% at 2 PM
# baseline.mean = 72.3%, baseline.stddev = 2.1%
# (91.4 - 72.3) / 2.1 = 9.1σ → "CRITICAL"

Baselines Adapt as Infrastructure Evolves

Rolling 7-day windows prevent baseline staleness. Gradual infrastructure growth is absorbed automatically. Sudden spikes are detected correctly. One-time events like Black Friday are forgotten after 7 days.

Traffic Growth Over 6 Months

Month 1

5,000 req/sec ± 800

Reality: 5,200 avg

Normal (1σ)

Month 2

6,500 req/sec ± 1,000

Reality: 6,800 avg

Normal (1σ)

Month 4

9,200 req/sec ± 1,400

Reality: 9,500 avg

Normal (1σ)

Month 6

12,000 req/sec ± 1,800

Reality: 12,300 avg

Normal (1σ)

Static threshold would have broken by Month 2.

Problem: Fixed Baselines Go Stale
Week 1 baseline: CPU 40% (calculated from Days 1-7)
Week 52 baseline: Still 40% (never updated)
Reality: Infrastructure grew, optimized, changed
Actual CPU Week 52: 70% average
Result: Every metric triggers as anomaly (70% vs 40% baseline = 14σ!)
Solution: Rolling 7-Day Window
Day 1-7:   Baseline calculated from Days 1-7
Day 8:     Baseline calculated from Days 2-8 (drop Day 1, add Day 8)
Day 9:     Baseline calculated from Days 3-9 (drop Day 2, add Day 9)
...
Day 365:   Baseline calculated from Days 359-365

Baseline continuously adapts to infrastructure changes

Gradual vs. Sudden Changes

Gradual Change (Absorbed)
Gradual Growth — No False Alarms
Day 1:   CPU 40%
Day 2:   CPU 41%
Day 3:   CPU 42%
...
Day 90:  CPU 65%

Change: +0.28% per day (gradual)
Baseline: Tracks smoothly (40% → 42% → 45% → ... → 65%)
Result: No false alarms, baseline adapts
Sudden Change (Detected)
Sudden Spike — Correctly Flagged
Day 1-89:  CPU 40% ± 3%
Day 90:    CPU 85% (sudden spike)

Change: +45% in one day (sudden)
Baseline: Still ~40% (7-day window includes Days 83-89, all ~40%)
Deviation: (85 - 40) / 3 = 15σ
Result: CRITICAL anomaly detected (correct!)
Black Friday: One-Time Event Handling
Nov 20-23: Normal traffic (5,000 req/sec baseline)
Nov 24 (Black Friday): 50,000 req/sec (10× spike)
Nov 25-30: Back to normal (5,000 req/sec)

Baseline during Black Friday week:
  Days: Nov 18, 19, 20, 21, 22, 23, 24
  Values: 5K, 5K, 5K, 5K, 5K, 5K, 50K
  Mean: 11.4K (inflated by one spike day)

Baseline 7 days later (Dec 1):
  Days: Nov 25, 26, 27, 28, 29, 30, Dec 1
  Values: 5K, 5K, 5K, 5K, 5K, 5K, 5K
  Mean: 5K (back to normal, spike forgotten)

Result: Black Friday spike doesn't pollute baseline forever
Weighted Rolling Average (Advanced)
# Weighted calculation (more recent = more important)
weights = [0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 2.0]  # Day 1-7
weighted_mean = sum(value[i] * weight[i] for i in range(7)) / sum(weights)

# Example:
values = [40, 41, 42, 43, 44, 45, 70]  # CPU % over 7 days
weights = [0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 2.0]

simple_mean = 46.4%  # Equal weight
weighted_mean = 52.1%  # Recent spike (Day 7 = 70%) weighted higher

Result: Baseline adapts faster to recent changes

How Dynamic Baselines Eliminate Alert Fatigue

Three techniques reduce false positives from 60-70% to under 10%: persistence filtering, known event exclusions, and multi-metric correlation.

Before: Static Thresholds
Alert Fatigue (60-70% False Positive)
Static thresholds — 100 alerts per week:
  - 60-70 false positives (engineer investigates, finds nothing wrong)
  - 20-30 real incidents (require action)
  - 5-10 missed incidents (threshold too loose)

Engineer response:
  Week 1:  Investigates all 100 alerts diligently
  Week 4:  Starts ignoring "probably false" alerts
  Week 8:  Ignores 80% of alerts (alert fatigue sets in)
  Week 12: Misses critical incident (lost in noise)

Result: Alerts become useless
After: Dynamic Baselines
Trustworthy Alerts (5-10% False Positive)
Dynamic baselines — 100 incidents detected per week:
  - 5-10 false positives (5-10% rate, down from 60-70%)
  - 85-90 real anomalies (true positive rate: 90%)
  - 2-3 missed incidents (sensitivity: 97%)

Engineer response:
  Every alert is probably real
  Engineers trust the system
  Alerts get investigated promptly
  Real incidents caught early

Result: Alerts remain useful

Persistence Filtering

Problem

Single data point spikes cause false alarms.

Solution

Anomaly must persist for 2+ minutes (4+ consecutive samples). Transient spikes—disk write bursts, CPU microbursts—ignored automatically. Real sustained anomalies always detected.

Known Event Exclusions

Problem

Scheduled maintenance triggers false alarms every time.

Solution

Configure known events (cron schedules) that suppress alerts during maintenance windows. Database backups every Sunday 2 AM, monthly batch jobs—all excluded automatically. Real anomalies outside maintenance windows still detected.

Multi-Metric Correlation

Problem

Single metric anomaly might be harmless.

Solution

Cross-check correlated metrics before escalating. Disk at 91% alone = lower confidence. Disk at 91% + rising error rate + high IOPS = high confidence. Correlated anomalies get autonomous execution; isolated anomalies require approval.

Persistence Filtering: How It Works
Disk usage samples (every 30s):
72%, 73%, 91%, 74%, 72%, 73%
              ↑
          One spike (temporary write, then deleted)

Static threshold: Triggers immediately on 91%

Dynamic baseline with persistence:
  Sample 1: 91% (anomaly detected)
  Sample 2: 74% (back to normal)
  Duration: 30 seconds (< 2 minutes required)
  Action: Ignore (transient spike)

Sustained anomaly:
  Sample 1-6: 91%, 92%, 90%, 91%, 93%, 91%
  Duration: 3 minutes (> 2 minutes required)
  Action: Trigger playbook (real incident)
Persistence Detection (Pseudocode)
def detect_persistent_anomaly(metric_history,
                              threshold_sigma=2.0,
                              duration_seconds=120):
    anomaly_start = None

    for sample in metric_history:
        if sample.deviation > threshold_sigma:
            if anomaly_start is None:
                anomaly_start = sample.timestamp
            else:
                duration = sample.timestamp - anomaly_start
                if duration >= duration_seconds:
                    return True  # Persistent anomaly
        else:
            anomaly_start = None  # Reset if back to normal

    return False  # No persistent anomaly
Known Event Exclusions (Config)
# Known events configuration
known_events:
  - name: weekly_database_backup
    schedule: "0 2 * * 0"  # Cron: Every Sunday 2 AM
    duration: 120  # minutes
    suppress_alerts:
      - disk_usage
      - cpu_usage
      - network_throughput
    hosts:
      - "*.db.*"  # All database servers
Event Suppression Logic
def should_suppress_alert(metric, host, timestamp):
    for event in known_events:
        if event.matches(host) and event.is_active(timestamp):
            if metric in event.suppress_alerts:
                return True  # Suppress, known event
    return False  # Don't suppress
Multi-Metric Correlation Analysis
Primary anomaly: disk_usage = 91%

Check correlated metrics:
  - inode_usage: 45% (normal, not exhausted)
  - disk_write_iops: 1,200 (normal, not thrashing)
  - disk_read_latency: 3ms (normal, no performance degradation)
  - application_errors: 0 (normal, no failures)
  - tcp_connections: 150 (normal, services healthy)

Interpretation:
  Disk space is high, but performance is normal.
  Likely cause: Legitimate large file write.

Action: Lower confidence (0.75 → requires approval vs. autonomous)
Correlation Rules (Config)
# Multi-metric correlation
correlation_rules:
  disk_usage_critical:
    primary: disk_usage > baseline + 4σ
    correlations:
      - metric: inode_usage
        threshold: > 90%
        weight: 0.3
      - metric: disk_write_iops
        threshold: > baseline + 3σ
        weight: 0.2
      - metric: application_errors
        threshold: > 0
        weight: 0.5

    confidence_boost: 0.2  # If correlations match, increase confidence

From Warning to Critical Based on Deviation

Four severity tiers determined by statistical deviation. Higher deviation = rarer event = faster response. Critical anomalies (>4σ) trigger autonomous resolution.

Normal

0-2σ~95% of samples

CPU 68% (baseline 65% ± 5%) → 0.6σ deviation

Response

No action needed. Expected variance within normal distribution.

Warning

2-3σ~2-5% of samples

Memory 9.3 GB (baseline 8.2 GB ± 0.5 GB) → 2.2σ deviation

Response

Incident logged in dashboard. No playbook execution yet. If persists >10 minutes → escalate to HIGH.

High

3-4σ~0.1-0.5% of samples

Disk 74% (baseline 68% ± 2%) → 3.0σ deviation

Response

RAG selects playbook. Approval required via Slack notification. Human confirms before execution. Typical for first-time incidents.

Critical

>4σ<0.01% of samples

Disk 91% (baseline 68% ± 2%) → 11.5σ deviation

Response

RAG selects playbook. If confidence >0.90: execute autonomously. If 0.70-0.90: fast-track Slack approval. If <0.70: escalate to human.

SeverityDeviationFrequencyResponseApproval
Normal0-2σ95%NoneN/A
Warning2-3σ2-5%Log onlyN/A
High3-4σ0.1-0.5%Playbook (approval)Required
Critical>4σ<0.01%Playbook (autonomous)Optional
Threshold Configuration (Customizable)
# Customizable per organization
anomaly_thresholds:
  warning: 2.0   # standard deviations
  high: 3.0
  critical: 4.0

  persistence_required:
    warning: 300    # 5 minutes
    high: 120       # 2 minutes
    critical: 60    # 1 minute (faster response for critical)

When Baselines Need Manual Intervention

Automatic rolling windows handle most changes. For major infrastructure shifts, two additional strategies provide control: manual reset and weighted baselines.

1

Automatic Rolling Window (Default)

7-day rolling average, updated daily. No manual intervention. Adapts to gradual changes automatically.

When It Works

Steady traffic growth, infrastructure optimization, seasonal patterns, normal operations.

Limitations

Major architecture change (microservices migration), platform migration (AWS → GCP), traffic pattern shift (B2B → B2C pivot).

2

Manual Baseline Reset

Dashboard → Anomaly Detection → Reset Baseline. Reset for specific host, host pattern, metric, or all. Starts fresh 7-day learning.

When It Works

Major infrastructure changes, platform migrations, architecture overhauls, traffic pattern shifts.

Limitations

7-day blind period without anomaly detection. Mitigate with temporary static thresholds.

3

Weighted Baseline (Prefer Recent Data)

Weight recent days higher than older days (2.0× for today, 0.5× for 7 days ago). Baseline adapts faster. Exponential decay option available.

When It Works

Rapid infrastructure changes, frequent deployments, A/B testing affecting metrics.

Limitations

Reduces statistical stability. Recent outliers have more influence.

Migration Failure → Manual Reset
Before migration: Monolith, CPU 40% baseline
After migration: Microservices, CPU 70% baseline

Problem: Rolling window takes 7 days to adapt
Result: 7 days of false alarms (70% vs 40% = constant anomalies)

Solution: Manual baseline reset
Dashboard → Anomaly Detection → Reset Baseline
→ Discards old baseline, starts fresh 7-day learning
Temporary Thresholds During Reset
# Temporary static thresholds during reset
temporary_thresholds:
  enabled: true
  duration: 7 days
  thresholds:
    cpu_usage: > 90%
    memory_usage: > 95%
    disk_usage: > 85%
Weighted vs. Standard Average
CPU usage last 7 days: [40%, 42%, 44%, 46%, 48%, 50%, 65%]

Standard average: 47.9%
Weighted average: 53.6% (Day 7's 65% weighted higher)

Current: 68%
Deviation (standard): (68 - 47.9) / stddev = 4.2σ → CRITICAL
Deviation (weighted): (68 - 53.6) / stddev = 2.8σ → HIGH

Result: Weighted baseline reduces false alarm severity
Weighted Baseline Configuration
# Enable weighted baseline
baseline_calculation:
  method: weighted_rolling_average
  weights: [0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 2.0]  # Day 1-7

# Or use exponential decay
baseline_calculation:
  method: exponential_decay
  decay_factor: 0.85  # Recent data 15% more important each day back

From Anomaly Detection to Autonomous Resolution

Five-stage pipeline: detect anomaly, enrich context, select playbook, execute, verify. Total latency: detection (100ms) + RAG (165ms) + execution (87s) + verification (10s) = 97 seconds from anomaly to resolution.

1

Stage 1: Anomaly Detected

<100ms

Metric arrives: prod-db-03, disk_usage = 91.4%. Baseline: 68.2% ± 2.1%. Deviation: (91.4 - 68.2) / 2.1 = 11.0σ. Severity: CRITICAL (>4σ).

2

Stage 2: Context Enrichment

<50ms

Anomaly enriched with host metadata (prod-db-03.us-east-1, production, postgresql), time context (2:47 PM, business hours), and correlated metrics (inode 45% normal, write IOPS normal, errors = 0). Interpretation: disk full, services still healthy.

3

Stage 3: RAG Playbook Selection

~165ms

Enriched context → RAG → Top 3 playbooks: disk_cleanup_prod_db (0.94), disk_cleanup_general (0.87), log_rotation_postgres (0.82). Selected: disk_cleanup_prod_db (highest confidence).

4

Stage 4: Execution Decision

87 seconds

Confidence 0.94 exceeds 0.90 threshold. Severity CRITICAL. Decision: execute autonomously. Playbook dispatched to agent on prod-db-03. Steps: clear temp files, rotate logs, verify disk returned to normal range.

5

Stage 5: Verification & Closure

~10s

Execution complete. Post-execution: disk_usage = 72.1% (down from 91.4%). Deviation: 1.9σ (back to normal). Incident closed: autonomous resolution successful.

Complete Flow Visualization
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Anomaly Detection                              │
│ Disk 91.4% (11.0σ above baseline) → CRITICAL          │
└──────────────────┬──────────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Context Enrichment                             │
│ + Host type, environment, time, correlated metrics     │
└──────────────────┬──────────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────────────┐
│ Stage 3: RAG Playbook Selection                         │
│ disk_cleanup_prod_db selected (0.94 confidence)        │
└──────────────────┬──────────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────────────┐
│ Stage 4: Autonomous Execution                           │
│ 0.94 > 0.90 → Execute without approval                 │
│ Duration: 87 seconds                                    │
└──────────────────┬──────────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────────────┐
│ Stage 5: Verification                                   │
│ Disk 72.1% (1.9σ, back to normal) → Incident closed   │
└─────────────────────────────────────────────────────────┘

End-to-End Latency Breakdown

100ms

Detection

50ms

Enrichment

165ms

RAG Selection

87s

Execution

10s

Verification

Total: ~97 seconds from anomaly to resolution

Common Questions

7 days of data required for statistical confidence. During first 7 days, SentienGuard uses conservative static thresholds (CPU >95%, Disk >90%, Memory >95%) to avoid false alarms. After 7 days, dynamic baselines activate automatically.

Rolling 7-day window adapts automatically to gradual changes. For sudden shifts (B2B → B2C pivot), manually reset baseline via dashboard. Takes 7 days to relearn new normal. During reset, use temporary static thresholds to avoid blind period.

Yes. Dashboard → Settings → Anomaly Detection → Threshold. Options: 1.5σ (sensitive, more alerts), 2.0σ (default, balanced), 2.5σ (conservative, fewer alerts), 3.0σ (only extreme anomalies). Applies organization-wide or per-host-pattern.

Reset baseline manually. Example: Baseline learned during DDoS attack (abnormally high traffic) → baseline inflated → real incidents missed. Solution: Dashboard → Reset Baseline → Fresh 7-day learning from normal operations.

SentienGuard uses statistical learning (mean, standard deviation, rolling windows), not deep learning neural networks. Benefits: Explainable (know why alert fired), fast (<100ms), low resource usage, no training data required. Trade-off: Simpler models, can't detect complex multi-variate patterns that ML might catch.

Yes. Dashboard → Anomaly Detection → Exclusions → Add Time Range. Specify date range to exclude from baseline calculation. Those days' data discarded, don't influence baseline. Useful for one-time events like Black Friday or product launches that shouldn't define "normal."

See Dynamic Baselines in Action

Deploy agents, watch 7-day baseline learning, trigger test incident, see severity classification, observe autonomous resolution.

Day 1Deploy agents, metrics flowing
Day 7Baseline established
Day 8Trigger test (fill disk to 90%)
Day 8Anomaly detected (severity: CRITICAL)
Day 8Playbook executes autonomously
Day 8Disk back to normal (incident closed)

Free tier: 3 nodes, 7-day baseline learning, full anomaly detection, no credit card.