HARP Ecosystem Diagram — MCP, LiteLLM, Prometheus, Telegram, OpenRAG, Nero Camp

What problem it solves

Before HARP, every incident forced context reconstruction.

The useful knowledge already existed across runbooks, RCAs, session logs, Prometheus rules, Gitea/Nero Camp runner notes, and OpenRAG troubleshooting. HARP makes that memory queryable, cited, and safe for AI agents to consume.

Before

Copy-paste triage

Humans manually paste alert text, runbook fragments, logs, and past fixes into separate AI chats.

Now

Source-grounded query

Agents call HARP through MCP or HTTP and receive file-path citations instead of free-floating advice.

Boundary

Evidence before action

HARP can diagnose and recommend next diagnostics. State-changing remediation stays behind human approval.

Retrieval and alert flow

The six-step evidence pipeline.

BM25 always runs first because it is fast, deterministic, and outage-friendly. OpenRAG is optional semantic enrichment. Hybrid results are merged, cited, and returned to humans or agents.

01

Signal

Prometheus alert, Telegram message, human question, CI failure, or Nero Camp service probe.

02

HARP entry

FastAPI endpoint, CLI, runbook explorer, or MCP tool call enters the same reliability memory.

03

BM25 search

Exact alert names, hostnames, error strings, paths, and commands are matched locally.

04

OpenRAG search

When reachable, semantic search catches symptom language and similar incident patterns.

05

RRF merge

Hybrid ranking combines BM25 and semantic results with source attribution.

06

Evidence pack

Citations, confidence, live telemetry, safe diagnostics, and approval status return.

Systems in the loop

Each component has one job.

HARP core

Reliability memory

Indexes docs, renders evidence packs, audits metadata, exposes API and MCP, and preserves a BM25-only outage mode.

Pathsharp/core, harp/api, harp/mcp

Surface/search, /search-hybrid, /mcp

OpenRAG

Semantic recall

Stores synchronized HARP content in a semantic backend so symptom-style queries find related runbooks and RCAs.

HostVM .63

FallbackBM25-only when unavailable

MCP

Safe agent door

Exposes read-only tools like harp_search, harp_search_hybrid, harp_triage_alert, harp_stats, and harp_health, plus bounded resources and prompts.

Transportstdio or streamable HTTP

Ruleno write tools

Prometheus

Signal and telemetry

Fires alerts, exposes live metrics, and gives HARP evidence packs a current view of the cluster or service.

PairAlertManager

Outputgraph links, alert state, PromQL

Human channel

Receives the primary AlertManager notification and, where enabled, a second HARP summary with citations and checklist.

Alwaysalert must arrive first

Laterapproval trigger, Git is record

LiteLLM

Optional classifier

The Phase 1 AI remediation agent can send redacted diagnostic output to LiteLLM/Bedrock only when ALLOW_CLOUD_LLM=true.

Defaultdisabled

Modeclassification, not execution

Nero Camp

Adjacent service and CI context

Nero Camp is not in HARP's control path. It appears as an app, monitored service, Gitea repo, and runner context that HARP can explain through probes and CI RCAs.

Servicenero.climacs.net

Runnernero-camp-runner on .61

Git and Gitea

System of record

Runbooks, RCAs, guardrails, approval records, and learning updates live in Git so the audit trail outlives chat messages.

Sourcedocs, planning, profiles

CIForgejo/Gitea runners

Others

Existing tools stay sovereign

Grafana, AlertManager, Caddy, Kubernetes, n8n, Argo CD, and future Rundeck remain their own systems. HARP integrates; it does not replace them.

Executefuture Rundeck lane

AccessCaddy and cluster ingress

Two operating lanes

Core stays boring. Assist stays optional.

HARP Core keep and harden

SearchBM25, hybrid retrieval, cited results.
EvidenceRead-only renderer, confidence, citations, safe next diagnostics.
MCPAgent-facing read-only tools, resources, prompts, and path-bounded document reads.
Outage modeCLI and local Git clone still work when APIs are down.

HARP Assist optional, measured

AlertsPrometheus and AlertManager can feed enrichment workflows.
TelemetryPrometheus queries are time-bounded and degrade softly.
LLMLiteLLM classification is gated and redacted.
NotifyTelegram reports summarize, but Git records approvals.

Interactive walkthrough

Click through a real HARP-style incident path.

This is a compact transcript model of the flow: an alert arrives, an agent connects through MCP, HARP searches BM25 and OpenRAG, live telemetry is attached, and the human receives a bounded recommendation.

Evidence pack preview read-only render

Alert: DiskAlmostFull on web-climacs-test. HARP returns citations, live context, and next diagnostics. It does not clean disk, restart services, or edit cluster state.

docs/troubleshooting/rca/rca-010-node-disk-pressure.md

docs/troubleshooting/session-log-2026-05-03-ai-remediation-agent.md

planning/guardrails/safe-action-registry.yaml

What the human sees approval required

Confidence: medium-high. Suggested diagnostics are read-only. Any cleanup, restart, scaling, GitOps change, or cloud-cost-impacting move is outside MCP and requires a recorded approval.

Allowed nowdescribe, query, summarize, cite

Blocked nowshell, kubectl mutate, Git write, reindex

Future laneRundeck job after approval

Guardrails

The system is designed to stop at the right line.

The operating boundary is deliberate: HARP is primarily knowledge and triage. Execution is a separate future lane, and MCP never exposes write tools.

R

Read-only diagnostics: search, stats, health, logs, metrics, citations.

MCP

No shell, no kubectl mutation, no Git writes, no reindex, no remediation tools.

P/X

Patch or mutate suggestions require human review and a recorded approval.

LLM

Cloud LLM classification is explicit opt-in with diagnostic redaction.

D

Destructive or irreversible actions are forbidden by policy.

Fallback

OpenRAG, Prometheus, MCP HTTP, or LiteLLM can fail without killing HARP Core.

Source notes

Built from the repo, not guesswork.

Architecture sources

harp/README.md, planning/architecture/system-overview.md, docs/planning/harp-operating-boundary-review-2026-05-04.md, harp/mcp/tools.py, and harp/api/server.py.

Assist-lane sources

planning/specs/003-alert-routing-and-enrichment.md, planning/specs/006-ai-alert-auto-remediation.md, scripts/ai-remediation-agent/, Nero Camp runner notes, and Prometheus/Telegram session logs.

Print note: interactive transcript buttons are static in print view.