Infrastructure, container topology, AI pipeline, security posture, and hardware specifications for IT architects, platform engineers, and technical decision-makers.
This is the actual architecture running on a self-hosted Proxmox VM — not a theoretical diagram. Every component listed below is deployed and operational.
Accesses Chainlit chat UI or API directly
https://<your-host>:8080/chainlit
Main API + Chainlit + Ray actor pool + Document Serializer
FastAPI Ray Whisper ~4.7 GB RAM
React file upload dashboard
:8060 ~31 MB RAM
BAAI/bge-small-en-v1.5 — 33M params
384-dim ~1.1 GB RAM
Routes OpenAI-compatible calls → AWS Bedrock
:4000 --drop_params ~800 MB
gte-multilingual-reranker-base
~2.1 GB Enable for 50+ docs
Vector database — stores document embeddings
~190 MB RAM
Object storage + metadata for Milvus
~163 MB combined
Internal metadata, partition config
~43 MB RAM
Claude 3.5 Haiku — inference profile: us.anthropic.claude-3-5-haiku-20241022-v1:0
DPA available. Data not used for training. ~$0.25/M input tokens.
When a user asks a question, this is the exact sequence of operations.
9 containers total. Reranker stopped to save 2.1GB RAM — enable for 50+ documents.
| Container | Image | RAM | Port | Status |
|---|---|---|---|---|
| openrag-cpu | linagoraai/openrag | 4.7 GB | :8080 | Running |
| litellm-proxy | ghcr.io/berriai/litellm | 800 MB | :4000 | Running |
| vllm-cpu | vllm (BGE-small) | 1.1 GB | :8000 | Healthy |
| milvus | milvusdb/milvus | 190 MB | :19530 | Healthy |
| minio | minio/minio | 116 MB | :9000 | Running |
| etcd | quay.io/coreos/etcd | 47 MB | :2379 | Running |
| rdb | postgres | 43 MB | :5432 | Running |
| indexer-ui | linagoraai/indexer-ui | 31 MB | :8060 | Running |
| reranker-cpu | gte-reranker-base | 2.1 GB | :7997 | Stopped |
Total idle RAM: ~6.4 GB (without reranker) / ~8.5 GB (with reranker) of 16 GB allocated.
Single VM on Proxmox, VMware, Hyper-V, or cloud VPS
1–2 VMs with reverse proxy, SSO, encrypted storage, audit logging
Kubernetes cluster with HA, multi-tenant isolation, GPU inference, observability
| Component | Level 1 | Level 2 | Level 3 |
|---|---|---|---|
| Embedder | BGE-small-en 33M params · 130 MB |
BGE-base-en 110M params · 400 MB |
BGE-large / multilingual GPU-accelerated |
| LLM | Cloud: Bedrock Haiku OR Ollama (Llama 3.1 8B) |
Cloud: Bedrock + DPA OR local Mistral 7B |
Fully local: Llama 70B OR Mixtral 8×7B (air-gap) |
| Reranker | None <50 documents |
BGE-reranker-base ~500 MB |
BGE-reranker-v2-m3 GPU-accelerated |
| Privacy | Embeddings local LLM via cloud DPA |
Everything local possible Cloud fallback with DPA |
Fully air-gapped Zero cloud dependency |
Source of truth on the OpenRAG VM: /opt/openrag/quick_start/.env
| File | Path on VM | Purpose |
|---|---|---|
| Main .env | /opt/openrag/quick_start/.env |
All environment variables |
| Docker Compose | /opt/openrag/quick_start/docker-compose.yaml |
Container definitions + volumes |
| Pipeline patch | /opt/openrag/pipeline_patched.py |
Removes vLLM-specific params (Bedrock compat) |
| LiteLLM config | /opt/openrag/litellm_config.yaml |
Bedrock model routing |
| Hydra config | /opt/openrag/.hydra_config/config.yaml |
OpenRAG internal settings |