HARP Ecosystem
Homelab and Hybrid AI Reliability Platform

How HARP turns alerts into cited operational memory.

HARP is the source-grounded reliability layer between monitoring, AI agents, knowledge systems, and human approval. MCP gives agents a safe read-only door. OpenRAG improves semantic retrieval. LiteLLM classifies diagnostics only when explicitly allowed. Prometheus and Telegram keep the human in the loop.

MCP read-only surface BM25 plus OpenRAG hybrid search Prometheus evidence enrichment Telegram follow-up reporting Nero Camp and Gitea runner context
HMCP stage
0.5.0
6schemas
cited docs
RRFhybrid merge
fallback
Control-room viewread-only first
MCP clientsCodex, Cursor, Claude Desktop, Cline
Prometheusrules, scrape health, live telemetry
LiteLLMoptional classifier via Bedrock/Mistral
HARPsearch, evidence, API, MCP, guardrails
Telegramprimary alert plus HARP follow-up
OpenRAGsemantic index on VM .63, BM25 fallback
Nero Camp and CIservice probe, runner context, RCAs
Git-tracked knowledgerunbooks, RCAs, sessions, profiles
agent context monitoring signal human gate
What problem it solves

Before HARP, every incident forced context reconstruction.

The useful knowledge already existed across runbooks, RCAs, session logs, Prometheus rules, Gitea/Nero Camp runner notes, and OpenRAG troubleshooting. HARP makes that memory queryable, cited, and safe for AI agents to consume.

Before

Copy-paste triage

Humans manually paste alert text, runbook fragments, logs, and past fixes into separate AI chats.

Now

Source-grounded query

Agents call HARP through MCP or HTTP and receive file-path citations instead of free-floating advice.

Boundary

Evidence before action

HARP can diagnose and recommend next diagnostics. State-changing remediation stays behind human approval.

Retrieval and alert flow

The six-step evidence pipeline.

BM25 always runs first because it is fast, deterministic, and outage-friendly. OpenRAG is optional semantic enrichment. Hybrid results are merged, cited, and returned to humans or agents.

01
Signal

Prometheus alert, Telegram message, human question, CI failure, or Nero Camp service probe.

02
HARP entry

FastAPI endpoint, CLI, runbook explorer, or MCP tool call enters the same reliability memory.

03
BM25 search

Exact alert names, hostnames, error strings, paths, and commands are matched locally.

04
OpenRAG search

When reachable, semantic search catches symptom language and similar incident patterns.

05
RRF merge

Hybrid ranking combines BM25 and semantic results with source attribution.

06
Evidence pack

Citations, confidence, live telemetry, safe diagnostics, and approval status return.

Systems in the loop

Each component has one job.

HARP core

Reliability memory

Indexes docs, renders evidence packs, audits metadata, exposes API and MCP, and preserves a BM25-only outage mode.

Pathsharp/core, harp/api, harp/mcp
Surface/search, /search-hybrid, /mcp
OpenRAG

Semantic recall

Stores synchronized HARP content in a semantic backend so symptom-style queries find related runbooks and RCAs.

HostVM .63
FallbackBM25-only when unavailable
MCP

Safe agent door

Exposes read-only tools like harp_search, harp_search_hybrid, harp_triage_alert, harp_stats, and harp_health, plus bounded resources and prompts.

Transportstdio or streamable HTTP
Ruleno write tools
Prometheus

Signal and telemetry

Fires alerts, exposes live metrics, and gives HARP evidence packs a current view of the cluster or service.

PairAlertManager
Outputgraph links, alert state, PromQL
Telegram

Human channel

Receives the primary AlertManager notification and, where enabled, a second HARP summary with citations and checklist.

Alwaysalert must arrive first
Laterapproval trigger, Git is record
LiteLLM

Optional classifier

The Phase 1 AI remediation agent can send redacted diagnostic output to LiteLLM/Bedrock only when ALLOW_CLOUD_LLM=true.

Defaultdisabled
Modeclassification, not execution
Nero Camp

Adjacent service and CI context

Nero Camp is not in HARP's control path. It appears as an app, monitored service, Gitea repo, and runner context that HARP can explain through probes and CI RCAs.

Servicenero.climacs.net
Runnernero-camp-runner on .61
Git and Gitea

System of record

Runbooks, RCAs, guardrails, approval records, and learning updates live in Git so the audit trail outlives chat messages.

Sourcedocs, planning, profiles
CIForgejo/Gitea runners
Others

Existing tools stay sovereign

Grafana, AlertManager, Caddy, Kubernetes, n8n, Argo CD, and future Rundeck remain their own systems. HARP integrates; it does not replace them.

Executefuture Rundeck lane
AccessCaddy and cluster ingress
Two operating lanes

Core stays boring. Assist stays optional.

HARP Core keep and harden

  • SearchBM25, hybrid retrieval, cited results.
  • EvidenceRead-only renderer, confidence, citations, safe next diagnostics.
  • MCPAgent-facing read-only tools, resources, prompts, and path-bounded document reads.
  • Outage modeCLI and local Git clone still work when APIs are down.

HARP Assist optional, measured

  • AlertsPrometheus and AlertManager can feed enrichment workflows.
  • TelemetryPrometheus queries are time-bounded and degrade softly.
  • LLMLiteLLM classification is gated and redacted.
  • NotifyTelegram reports summarize, but Git records approvals.
Interactive walkthrough

Click through a real HARP-style incident path.

This is a compact transcript model of the flow: an alert arrives, an agent connects through MCP, HARP searches BM25 and OpenRAG, live telemetry is attached, and the human receives a bounded recommendation.


      

Evidence pack preview read-only render

Alert: DiskAlmostFull on web-climacs-test. HARP returns citations, live context, and next diagnostics. It does not clean disk, restart services, or edit cluster state.

docs/troubleshooting/rca/rca-010-node-disk-pressure.md
docs/troubleshooting/session-log-2026-05-03-ai-remediation-agent.md
planning/guardrails/safe-action-registry.yaml

What the human sees approval required

Confidence: medium-high. Suggested diagnostics are read-only. Any cleanup, restart, scaling, GitOps change, or cloud-cost-impacting move is outside MCP and requires a recorded approval.

Allowed nowdescribe, query, summarize, cite
Blocked nowshell, kubectl mutate, Git write, reindex
Future laneRundeck job after approval
Guardrails

The system is designed to stop at the right line.

The operating boundary is deliberate: HARP is primarily knowledge and triage. Execution is a separate future lane, and MCP never exposes write tools.

R

Read-only diagnostics: search, stats, health, logs, metrics, citations.

MCP

No shell, no kubectl mutation, no Git writes, no reindex, no remediation tools.

P/X

Patch or mutate suggestions require human review and a recorded approval.

LLM

Cloud LLM classification is explicit opt-in with diagnostic redaction.

D

Destructive or irreversible actions are forbidden by policy.

Fallback

OpenRAG, Prometheus, MCP HTTP, or LiteLLM can fail without killing HARP Core.

Source notes

Built from the repo, not guesswork.

Architecture sources

harp/README.md, planning/architecture/system-overview.md, docs/planning/harp-operating-boundary-review-2026-05-04.md, harp/mcp/tools.py, and harp/api/server.py.

Assist-lane sources

planning/specs/003-alert-routing-and-enrichment.md, planning/specs/006-ai-alert-auto-remediation.md, scripts/ai-remediation-agent/, Nero Camp runner notes, and Prometheus/Telegram session logs.