Product Overview
From Detection to Resolution Without Waking Humans
Lightweight agents detect anomalies, RAG engine selects playbooks, autonomous execution fixes incidents, immutable logs prove it happened. Four-stage pipeline, <3 minutes total latency, zero manual intervention.
How It Works: End-to-End Architecture
Four components: agents in your infrastructure, control plane for intelligence, execution orchestrator for playbooks, immutable storage for audit trail.
Server 1
Agent
50 MB · <100 MB RAM
Server 2
Agent
50 MB · <100 MB RAM
Server N
Agent
50 MB · <100 MB RAM
Metrics every 30s (batched)
CPU, memory, disk, network, process count, service health
Outbound HTTPS (443) · TLS 1.3 · Cert Pinning
- Time-series database
- 30-second intervals
- Metric normalization
- Dynamic baselines (7-day rolling)
- Statistical analysis (σ deviation)
- Time-of-day pattern matching
- Incident → 1536-dim vector
- Semantic playbook search
- Confidence scoring (0.0-1.0)
- Context: host, env, time-of-day
- Ed25519 playbook signing
- Command dispatch to agent
- Health verification per step
- Automatic rollback on failure
- AWS S3 + Object Lock (WORM)
- SHA-256 hash-chained entries
- Immutable — cannot be modified
- 2-year retention (7-year configurable)
Playbook execution results (commands, outputs, timestamps)
- Bash commands & scripts
- File operations (cleanup, rotation)
- Service management (restart, reload)
- Database connection management
- Pod restarts & eviction
- Horizontal scaling (replica count)
- Rolling rollbacks
- Node drain & cordon
End-to-End Flow: Detection to Resolution
Metrics Collection (30-second intervals)
- Agent collects: CPU usage per core, memory (used/available/swap), disk usage per filesystem, network (bytes in/out, packet loss), process count, service health checks
- Agent batches metrics, sends via HTTPS to control plane
- Latency: <200ms from collection to ingestion
Anomaly Detection (real-time statistical analysis)
- Control plane maintains 7-day rolling baseline per metric per host
- Calculates: mean, standard deviation, time-of-day patterns for every metric
- Detects: Deviations >2σ from expected (configurable threshold per metric)
- Example: Disk usage 91% when baseline is 68% ± 5% = 4.6σ deviation → anomaly
- Latency: <100ms from metric arrival to anomaly detection
Playbook Selection (RAG semantic search)
- Incident converted to vector embedding (1536 dimensions) capturing metric type, host context, environment, time-of-day
- Semantic search across playbook library (50+ pre-built playbooks)
- Context matching: host type (VM, container, bare metal), environment (prod/staging), time-of-day, historical success rate for similar incidents
- Confidence scoring: >0.90 autonomous execution, 0.70-0.90 requires human approval via Slack, <0.70 escalates to on-call with full context
- Latency: <165ms from anomaly to playbook selection
Execution (autonomous or approval-gated)
- Control plane signs playbook with Ed25519 cryptographic signature
- Agent verifies signature and timestamp freshness (<5 minutes) before execution
- Agent executes steps via SSH, kubectl, or cloud provider APIs on the target host
- Health verification after each step confirms the action had the desired effect
- Automatic rollback reverses all changes if any verification step fails
- Latency: 10-90 seconds depending on playbook complexity
Audit Logging (immutable storage)
- Every action logged: command text, full stdout/stderr output, nanosecond timestamp, exit code, RBAC authorizer identity
- Stored in AWS S3 with Object Lock (Write Once, Read Many) — cannot be modified or deleted
- Hash-chained entries: each record contains SHA-256 hash of the previous record, creating tamper-evident chain
- Retention: 2 years default (hot storage), configurable to 7 years (cold storage) for regulated industries
- Export formats: JSON for API and SIEM integration, CSV for spreadsheets, formatted PDF reports for auditor handoff
Total Pipeline Latency: Detection → Resolution
Two Ways to Deploy: Agent-Based or Direct API
Agent-Based
Recommended — 95% of deploymentsLightweight binary (50 MB) installed on each server. Collects metrics, executes playbooks locally, reports results to control plane. Full autonomous resolution capability with zero inbound attack surface.
Supported Platforms
- Linux: Ubuntu 20.04+, CentOS 7+, Debian 10+, RHEL 8+
- Architectures: x86_64, ARM64
- Container: Kubernetes (Helm chart), Docker (container runtime)
- Cloud: AWS EC2/EKS, GCP Compute/GKE, Azure VMs/AKS
- On-premises: Bare metal, VMware, Proxmox
Installation
Linux
curl -sSL https://get.sentienguard.com/install | bashKubernetes (Helm)
helm repo add sentienguard https://charts.sentienguard.com helm install sentienguard sentienguard/agent \ --set apiKey=$SENTIENGUARD_API_KEY
Resource Usage
What Agent Collects
- Infrastructure metrics: CPU, memory, disk, network (via eBPF + system APIs)
- Process metrics: count, resource usage per process, open file descriptors
- Kubernetes metrics: pod status, node health, events (via kubectl API)
- Service health: HTTP endpoints, TCP ports, systemd unit status
What Agent Executes
- SSH commands: bash scripts, file operations, service management
- Kubernetes operations: kubectl (pod restart, scale, rollback, drain)
- Cloud provider APIs: AWS CLI, gcloud, az CLI for cloud-native operations
- Database queries: PostgreSQL, MySQL connection pool management
Security Model
- Outbound-only: Agent initiates HTTPS (443) to control plane, never listens
- No inbound ports: Zero attack surface from network scanning or exploitation
- TLS 1.3: Certificate pinning prevents man-in-the-middle attacks
- Cryptographic verification: Playbooks signed by control plane, agent verifies before execution
- Non-root execution: Runs as dedicated service account with minimal privileges (configurable)
Advantages
- Full playbook execution capability (not just metrics collection)
- Works in air-gapped environments with cached playbooks (Enterprise tier)
- Local execution means faster remediation (<60s typical round-trip)
- Offline resilience: agent caches playbooks, executes during network outage
Direct API
Specialized Use CasesSend metrics directly to SentienGuard API endpoint (/v1/incidents). No agent installation required. Metrics-only mode with AI-powered playbook recommendations. Ideal for serverless architectures or gradual evaluation before full agent deployment.
Supported Platforms
- AWS Lambda (serverless functions)
- Cloud Run, Cloud Functions (GCP serverless)
- Azure Functions (serverless)
- Edge computing (Cloudflare Workers, Lambda@Edge)
- Custom applications (any HTTP client)
API Call Example
curl -X POST https://api.sentienguard.com/v1/incidents \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"host": "lambda-payment-processor",
"metric": "invocation_errors",
"value": 15.2,
"threshold": 5.0,
"environment": "production"
}'What You Send
- Metric name (string): "disk_usage", "error_rate", "latency_p99"
- Current value (number): 91.4, 12.5, 850
- Threshold (number): 85.0, 5.0, 500
- Host identifier (string): unique identifier for the metric source
- Environment (string): "production", "staging", "dev"
What SentienGuard Does
- Anomaly detection: compare value against dynamic baseline for that metric
- Playbook selection: RAG matches incident to remediation strategy
- Notification: Slack/email alert with recommended playbook and confidence score
- Audit logging: record incident, recommendation, and outcome
Limitations
- No autonomous execution (cannot run playbooks without an installed agent)
- Metrics-only mode (detection and recommendation without remediation)
- Must trigger remediation manually or via webhook callback
When to Use
- Serverless architectures (no persistent servers for agent installation)
- Custom monitoring systems (already collecting metrics, want AI playbook recommendations)
- Alerting enrichment (add RAG-powered recommendations to existing alerting pipeline)
- Gradual evaluation (start with metrics-only, add agents to production later)
Advantages
- Zero infrastructure overhead (no agent binary to manage or update)
- Works with serverless and ephemeral compute environments
- Simple HTTP integration from any language or platform
- Gradual adoption path: start with metrics, add agents when ready
| Feature | Agent-Based | Direct API |
|---|---|---|
| Installation | Binary or Helm chart | HTTP POST to endpoint |
| Metrics Collection | Automatic (30s intervals) | Manual (you send) |
| Playbook Execution | Autonomous | Manual only |
| Latency to Resolution | <90s autonomous | Hours (human-dependent) |
| Supported Platforms | Linux, Kubernetes | Any HTTP client |
| Resource Usage | 100 MB RAM, 0.5% CPU | Zero (no agent) |
| Air-Gapped Support | Yes (Enterprise) | No (requires internet) |
| Audit Logging | Complete trail | Detection only |
| Cost | $4/node/month | $4/incident/month |
| Best For | Production infrastructure | Serverless, evaluation |
95% of deployments use agent-based model for full autonomous resolution. Direct API is for serverless architectures or gradual evaluation. Start with agents for production workloads.
Zero Inbound Attack Surface
Outbound-Only Communication
Principle
Agents never listen on network ports. All communication initiated outbound from agent to control plane. No inbound connections accepted.
Implementation
- Agent connects to: control.sentienguard.com:443
- Protocol: HTTPS (TLS 1.3)
- Direction: Outbound only (agent → control plane)
- Firewall rules: Allow outbound 443, deny all inbound
- NAT-friendly: Works behind corporate firewalls and HTTP proxies
What This Prevents
- Inbound exploitation (no listening ports to attack)
- Lateral movement (compromised agent cannot accept commands from attacker)
- Port scanning (no services exposed to network)
Comparison
Certificate Pinning
Principle
Agent trusts only SentienGuard's specific TLS certificate. Man-in-the-middle attacks impossible even if attacker has a valid CA-signed certificate.
Implementation
- SentienGuard CA certificate hash embedded in agent binary at compile time
- Agent verifies server certificate matches pinned hash on every connection
- Connection refused if certificate does not match (no fallback to CA trust)
- Certificate rotation: new agent version required (controlled deployment)
What This Prevents
- Man-in-the-middle attacks (attacker cannot impersonate control plane)
- Rogue control plane (agent refuses connection to unauthorized servers)
- Certificate authority compromise (pinning bypasses entire CA trust chain)
Certificate Pinning Verification (Pseudocode)
expectedHash := "sha256:a3f8b9c2d1e4..."
actualHash := sha256(serverCertificate)
if actualHash != expectedHash {
return error("Certificate pinning failed")
}Cryptographic Playbook Signing
Principle
Every playbook signed by control plane with private key. Agent verifies signature before execution. Prevents unauthorized command injection.
Implementation
- Control plane signs playbook with Ed25519 private key
- Signature covers: playbook YAML, timestamp, incident ID, target host
- Agent verifies signature with public key (embedded in agent binary)
- Execution proceeds only if signature is valid AND timestamp is fresh (<5 minutes)
What This Prevents
- Unauthorized playbook injection (attacker cannot forge Ed25519 signature)
- Replay attacks (timestamp freshness check rejects stale playbooks)
- Playbook tampering (any modification invalidates the signature)
Signed Playbook Payload
{
"playbook": "disk_cleanup_prod_db",
"version": "1.4.2",
"incident_id": "inc_2026_02_10_1435",
"target_host": "prod-db-03.us-east-1",
"timestamp": "2026-02-10T14:35:43.891Z",
"signature": "ed25519:a8f3b2c1d9e4..."
}Agent Verification Process
- 1.Extract signature from payload
- 2.Verify signature using control plane public key
- 3.Check timestamp (must be within 5 minutes of current time)
- 4.Verify target host matches agent's hostname
- 5.If all checks pass: execute playbook
- 6.If any check fails: reject, log failed authorization attempt
Attack Surface Analysis
Traditional Monitoring Agent
- Listening ports (StatsD, HTTP metrics endpoint)
- Accepts inbound connections from any source
- Trusts CA-signed certificates (MITM vulnerable)
- Executes commands from any authenticated source
SentienGuard Agent
- Zero listening ports (outbound-only)
- Refuses all inbound connections
- Certificate pinning (MITM impossible)
- Cryptographically signed playbooks only
Result: 90% reduction in attack surface compared to traditional monitoring agents.
Performance & Capacity
Agent Performance
Latency
Throughput
Resource Limits
Control Plane Performance
Latency
Throughput
Availability
Storage Performance
Audit Logs
Capacity (1,000-node deployment)
Playbook Library
Capacity
Performance
Six Core Components
Click through to deep-dive pages for technical details on each component.
Semantic Playbook Matching
1536-dimension vector embeddings match incidents to remediation strategies using retrieval-augmented generation. Context matching evaluates host type, environment, time-of-day, and historical success rates. Confidence scoring determines autonomous execution (>0.90), human approval (0.70-0.90), or escalation (<0.70). The system gets smarter over time as successful resolutions reinforce playbook confidence scores and failed attempts get flagged for review and refinement.
Learn More →Lightweight, Outbound-Only
50 MB binary with <100 MB RAM resident footprint. Zero inbound ports opened on your infrastructure. Certificate pinning prevents man-in-the-middle attacks. Cryptographic playbook signing ensures only authorized remediation executes. Non-root service account with minimal privileges. Deploys in 2 minutes via one-liner or Helm chart. Works behind corporate firewalls and NAT gateways without configuration.
Learn More →Dynamic Baselines, Not Static Thresholds
7-day rolling average with time-of-day patterns captures Monday morning traffic spikes and Friday evening lulls. Statistical deviation detection triggers on >2σ deviations from expected behavior. Adapts to infrastructure growth, seasonal patterns, and deployment cadences automatically. New deployments recalibrate baselines within 48 hours. No manual threshold tuning needed. High-signal, low-noise anomaly detection that catches real problems and ignores expected fluctuations.
Learn More →Execute, Verify, Rollback
Idempotent playbooks execute via SSH, kubectl, or cloud provider APIs. Every step includes health verification to confirm the action had the desired effect before proceeding. If any verification step fails, automatic rollback reverses all changes made during the current execution. Complete stdout/stderr captured for every command. Cryptographically signed audit trail records exactly what was run, when, by which agent, on which host. Typical execution under 60 seconds for routine infrastructure fixes.
Learn More →Immutable Compliance Evidence
S3 Object Lock (Write Once, Read Many) prevents modification or deletion of audit records. SHA-256 hash-chained entries create tamper-evident chain that auditors can independently verify. 2-year default retention, configurable to 7 years for regulated industries. Each entry captures: Who, RBAC Authorizer, What, When, Where, and Result. Satisfies HIPAA §164.312(b), SOC 2 CC6.1/CC7.2, PCI-DSS Requirement 10, ISO 27001 A.12.4. Export as JSON, CSV, or formatted PDF.
Learn More →Unified Infrastructure Dashboard
Real-time health monitor showing fleet status across all environments. Incident timeline with full execution history and audit trail. Playbook library with search, import, and custom YAML editor. User management with RBAC roles: Observer (view only), Remediation Authority (approve and execute), Admin (full control). Multi-tenant architecture for MSPs managing multiple client environments with strict isolation. API access for programmatic integration with existing tooling.
Learn More →Works With Your Existing Stack
Monitoring Sources
Datadog
Import monitors as playbook triggers
Prometheus
AlertManager webhook integration
CloudWatch
SNS to SentienGuard API endpoint
Grafana
Webhook notification channel
Custom metrics
HTTP POST to /v1/incidents API
Execution Targets
SSH
Linux servers, bash commands, file operations
Kubernetes
kubectl via API (pod restart, scale, rollback, drain)
AWS
CLI, boto3, CloudFormation stack operations
GCP
gcloud, Cloud SDK for Compute, GKE, Cloud SQL
Azure
az CLI, ARM templates for VMs, AKS, SQL
Notification Channels
Slack
Approval gates, incident summaries, resolution reports
SMTP, SendGrid, AWS SES integration
PagerDuty
Escalation on autonomous failure or low confidence
Webhooks
Custom HTTP callbacks for any integration
SMS
Twilio for critical alerts and escalation
Storage & Logging
AWS S3
Primary audit log storage with Object Lock (WORM)
Elasticsearch
Optional log shipping for search and analysis
Splunk
SIEM integration for security event correlation
Datadog Logs
Forward audit logs if keeping Datadog for dashboards
Sumo Logic
Log aggregation and compliance reporting
How Teams Deploy SentienGuard
Replace Datadog Entirely
Setup
- Datadog current cost: $18K/month (500 nodes, $15/host + metrics + APM)
- Deploy SentienGuard agents on all 500 nodes ($4/node = $2K/month)
- Import Datadog monitors as SentienGuard playbook triggers
- Self-host Grafana for dashboards ($0) or use Grafana Cloud ($1.5K/month)
Timeline
Run both in parallel (validation). Prove 87% autonomous resolution on live incidents. Team reviews every auto-resolved incident to build confidence.
Shift alerting to SentienGuard as primary responder. Datadog becomes read-only dashboards. Cancel Datadog alerting and APM tiers.
Cancel Datadog entirely. Deploy Grafana for any dashboard needs. Full autonomous resolution operational.
Optimized state. Engineering team fully reclaimed for product work. On-call pages reduced 87%.
Result
Cost: $18K/month → $2K/month (89% reduction)
MTTR: 4 hours → 90 seconds (96% improvement)
Savings: $192K/year
Hybrid (Keep Datadog Dashboards)
Setup
- Datadog current cost: $18K/month (500 nodes, full suite)
- Deploy SentienGuard for autonomous remediation ($2K/month)
- Downgrade Datadog to infrastructure metrics only (no alerting, no APM, no log management)
- SentienGuard handles all incident detection and resolution
Timeline
Run both systems in parallel. SentienGuard shadow-resolves incidents while Datadog remains primary. Compare resolution times and accuracy.
Cancel Datadog alerting, APM, and log management tiers. Retain infrastructure metrics for dashboards. Route all incident response through SentienGuard.
Steady state. Datadog provides read-only dashboards at reduced tier ($4K/month). SentienGuard handles all autonomous resolution ($2K/month).
Result
Cost: $18K/month → $6K/month (67% reduction)
MTTR: 4 hours → 90 seconds (autonomous)
Savings: $144K/year
Greenfield (No Existing Monitoring)
Setup
- No Datadog, New Relic, or Prometheus subscription to migrate from
- Deploy SentienGuard agents on all production nodes ($4/node/month)
- Use included 50+ playbook library for common infrastructure incidents
- Add Grafana for dashboards (optional, $0 self-hosted or $1.5K/month cloud)
Timeline
Deploy agents across fleet. Import standard playbook library. Agents begin collecting metrics and building baselines immediately.
Tune baselines as 7-day rolling average establishes patterns. Add custom playbooks for application-specific scenarios. Start with approval mode, transition to autonomous.
Full autonomous operations. Baselines calibrated. Custom playbooks tested and deployed. On-call team focused on strategic work, not firefighting.
Result
Cost: $2K/month SentienGuard + $0 Grafana = $2K/month total
MTTR: <90 seconds from day 1
Savings: No legacy monitoring bills to compare against
Deploy Your First Agent in 2 Minutes
Free for 3 nodes. No credit card. First autonomous resolution in 8 minutes.
Install Agent
Linux
curl -sSL https://get.sentienguard.com/install | bashKubernetes (Helm)
helm repo add sentienguard \ https://charts.sentienguard.com helm install sentienguard \ sentienguard/agent \ --set apiKey=$SENTIENGUARD_API_KEY
Docker
docker run -d \
--name sentienguard-agent \
-e SENTIENGUARD_API_KEY=$API_KEY \
-v /var/run/docker.sock:\
/var/run/docker.sock \
sentienguard/agent:latestTime: 2 minutes
Verify Connection
Check agent status
# Check agent status systemctl status sentienguard-agent # View logs tail -f /var/log/sentienguard/agent.log
Expected Output
[INFO] Connected to control plane [INFO] Heartbeat sent (30s interval) [INFO] Baseline learning started [INFO] 6 metrics streaming
Time: 30 seconds
Import Playbooks
Via Dashboard
- Navigate to: Playbooks → Import
- Select: disk_cleanup, memory_restart, k8s_pod_restart
- Click: Import All
Via CLI
sentienguard playbook import \ disk_cleanup_linux sentienguard playbook import \ postgres_connection_reset sentienguard playbook import \ ssl_cert_renewal
Time: 5 minutes
Total: 8 minutes to first autonomous resolution.
Ready to See It in Action?
Deploy free on 3 nodes. Trigger a test incident. Watch autonomous resolution.
Review the audit log. All in under 10 minutes.
Free tier: 3 nodes, unlimited playbooks, full audit logs, no credit card required. Upgrade anytime to scale beyond 3 nodes.