Before HARP, every incident forced context reconstruction.
The useful knowledge already existed across runbooks, RCAs, session logs, Prometheus rules, Gitea/Nero Camp runner notes, and OpenRAG troubleshooting. HARP makes that memory queryable, cited, and safe for AI agents to consume.
Copy-paste triage
Humans manually paste alert text, runbook fragments, logs, and past fixes into separate AI chats.
Source-grounded query
Agents call HARP through MCP or HTTP and receive file-path citations instead of free-floating advice.
Evidence before action
HARP can diagnose and recommend next diagnostics. State-changing remediation stays behind human approval.
The six-step evidence pipeline.
BM25 always runs first because it is fast, deterministic, and outage-friendly. OpenRAG is optional semantic enrichment. Hybrid results are merged, cited, and returned to humans or agents.
Prometheus alert, Telegram message, human question, CI failure, or Nero Camp service probe.
FastAPI endpoint, CLI, runbook explorer, or MCP tool call enters the same reliability memory.
Exact alert names, hostnames, error strings, paths, and commands are matched locally.
When reachable, semantic search catches symptom language and similar incident patterns.
Hybrid ranking combines BM25 and semantic results with source attribution.
Citations, confidence, live telemetry, safe diagnostics, and approval status return.
Each component has one job.
Reliability memory
Indexes docs, renders evidence packs, audits metadata, exposes API and MCP, and preserves a BM25-only outage mode.
Semantic recall
Stores synchronized HARP content in a semantic backend so symptom-style queries find related runbooks and RCAs.
Safe agent door
Exposes read-only tools like harp_search, harp_search_hybrid, harp_triage_alert, harp_stats, and harp_health, plus bounded resources and prompts.
Signal and telemetry
Fires alerts, exposes live metrics, and gives HARP evidence packs a current view of the cluster or service.
Human channel
Receives the primary AlertManager notification and, where enabled, a second HARP summary with citations and checklist.
Optional classifier
The Phase 1 AI remediation agent can send redacted diagnostic output to LiteLLM/Bedrock only when ALLOW_CLOUD_LLM=true.
Adjacent service and CI context
Nero Camp is not in HARP's control path. It appears as an app, monitored service, Gitea repo, and runner context that HARP can explain through probes and CI RCAs.
System of record
Runbooks, RCAs, guardrails, approval records, and learning updates live in Git so the audit trail outlives chat messages.
Existing tools stay sovereign
Grafana, AlertManager, Caddy, Kubernetes, n8n, Argo CD, and future Rundeck remain their own systems. HARP integrates; it does not replace them.
Core stays boring. Assist stays optional.
HARP Core keep and harden
- SearchBM25, hybrid retrieval, cited results.
- EvidenceRead-only renderer, confidence, citations, safe next diagnostics.
- MCPAgent-facing read-only tools, resources, prompts, and path-bounded document reads.
- Outage modeCLI and local Git clone still work when APIs are down.
HARP Assist optional, measured
- AlertsPrometheus and AlertManager can feed enrichment workflows.
- TelemetryPrometheus queries are time-bounded and degrade softly.
- LLMLiteLLM classification is gated and redacted.
- NotifyTelegram reports summarize, but Git records approvals.
Click through a real HARP-style incident path.
This is a compact transcript model of the flow: an alert arrives, an agent connects through MCP, HARP searches BM25 and OpenRAG, live telemetry is attached, and the human receives a bounded recommendation.
Evidence pack preview read-only render
Alert: DiskAlmostFull on web-climacs-test. HARP returns citations, live context, and next diagnostics. It does not clean disk, restart services, or edit cluster state.
What the human sees approval required
Confidence: medium-high. Suggested diagnostics are read-only. Any cleanup, restart, scaling, GitOps change, or cloud-cost-impacting move is outside MCP and requires a recorded approval.
The system is designed to stop at the right line.
The operating boundary is deliberate: HARP is primarily knowledge and triage. Execution is a separate future lane, and MCP never exposes write tools.
Read-only diagnostics: search, stats, health, logs, metrics, citations.
No shell, no kubectl mutation, no Git writes, no reindex, no remediation tools.
Patch or mutate suggestions require human review and a recorded approval.
Cloud LLM classification is explicit opt-in with diagnostic redaction.
Destructive or irreversible actions are forbidden by policy.
OpenRAG, Prometheus, MCP HTTP, or LiteLLM can fail without killing HARP Core.
Built from the repo, not guesswork.
Architecture sources
harp/README.md, planning/architecture/system-overview.md, docs/planning/harp-operating-boundary-review-2026-05-04.md, harp/mcp/tools.py, and harp/api/server.py.
Assist-lane sources
planning/specs/003-alert-routing-and-enrichment.md, planning/specs/006-ai-alert-auto-remediation.md, scripts/ai-remediation-agent/, Nero Camp runner notes, and Prometheus/Telegram session logs.
Print note: interactive transcript buttons are static in print view.