Homelab

I operate three Linux nodes connected over a private Tailscale mesh. The cluster runs public applications, a MongoDB replica set, local AI inference, automation, monitoring, logging, backups, and security services.

Every application runs in Docker with a Tailscale sidecar and no public host ports. Cloudflare Tunnels route public traffic, while Prometheus, Grafana, Loki, and exporters make the systems observable.

The local inference service runs Qwen3.5-35B-A3B through llama.cpp on a down-tuned RTX 3060 and reaches approximately 101 generated tokens per second. The sparse MoE architecture keeps active compute manageable, while model movement over PCIe is the primary bottleneck.

The network currently includes 25 Tailscale devices: 3 nodes, 2 laptops, and 20 services. Ansible playbooks handle repeatable deployments, health checks, backups, and maintenance across the nodes.

This page documents the hardware, replicated database, local AI, observability stack, automation, container architecture, and security layers.

Technologies used

Tech Stack

Network: Tailscale mesh VPN, UFW firewall, Cloudflare Tunnels
Containers: Docker, Docker Compose
Orchestration: Ansible playbooks for deployment and health checks
Database: MongoDB 8.0 replica set (primary, secondary, arbiter)
Monitoring: Prometheus, Grafana (7 dashboards), Node Exporter, cAdvisor, DCGM
Logging: Loki, Promtail (centralized log aggregation)
Local inference: Qwen3.5-35B-A3B via llama.cpp on an RTX 3060 (approximately 101 tokens/sec)
AI experiments: document retrieval, Qdrant, and cross-encoder reranking; not active services
Security: Fail2Ban, HashiCorp Vault, SSH hardening
Notifications: Slack webhooks, Cloudflare Email Workers
Web Frameworks: Next.js, FastAPI, Node.js, nginx
Mobile: React Native, Expo, iOS App Store, Google Play Store

Nodes

Primary Node (Prometheus)

CPU: AMD Ryzen 9 5950X (16 cores / 32 threads)
RAM: 32GB DDR4
GPU 0: NVIDIA RTX 5060 ti (16GB VRAM, 448GB/s bandwidth)
GPU 1: NVIDIA RTX 3060 (12GB VRAM, 360GB/s bandwidth)
Storage: 500GB NVMe SSD
OS: Ubuntu 24.04.3
Network: Ethernet — 668.91 Mbps down / 728.07 Mbps up
Location: Southern Florida

Beelink (Mini PC)

CPU: Intel N100 (4 cores / 4 threads)
RAM: 16GB DDR4
Storage: 500GB SSD
OS: Ubuntu 24.04.3
Network: Ethernet — 92.1 Mbps down / 17.6 Mbps up
Location: Argentina

Raspberry Pi 5

CPU: ARM Cortex-A76 (4 cores / 4 threads)
RAM: 8GB
Storage: 1TB Samsung T7 SSD (USB 3.2 Gen 2, 1,050MB/s)
OS: Debian GNU/Linux 12 (Bookworm)
Boot: External SSD, no SD card
Network: Ethernet — 105.75 Mbps down / 19.91 Mbps up
Location: Argentina

Database

MongoDB replica set across all three nodes. Primary node handles writes, Beelink is secondary with full data copy, Pi runs as arbiter for voting only.

Replica Set: rs0
Version: MongoDB 8.0.16
Failover: Automatic. If primary goes down, Beelink promotes within seconds.
Auth: keyFile authentication, 3-tier users (admin, app, monitor)
Replication monitoring: replica health, oplog position, and lag tracked across the Tailscale network
Backups: Automated mongodump to Pi, compressed .gz archives, 7-day retention

Performance tuning I did: kernel swappiness set to 1, Transparent Huge Pages enabled with defer+madvise, tcmalloc-google allocator. All MongoDB startup warnings eliminated.

Monitoring

Prometheus scrapes metrics every 15s from all nodes. 30-day retention. Grafana runs on the Pi for dashboards.

Grafana Data Sources:

Prometheus (metrics)
Loki (logs)

What I'm monitoring:

Node Exporter on all nodes (CPU, RAM, disk, network)
cAdvisor v0.51.0 on all nodes (container metrics)
MongoDB Exporter on all nodes (replica set health, connections, oplog)
NVIDIA DCGM Exporter (GPU temp, utilization, power draw, memory)
Promtail shipping logs from all nodes to Loki
Qwen inference server (latency, tokens/sec, GPU utilization)

cAdvisor optimization: Default config was eating CPU. Changed housekeeping interval to 10s, disabled collectors I don't need (tcp, udp, sched, process, hugetlb). 93-95% CPU reduction.

Logging

Centralized logging with Loki and Promtail. Loki runs on primary node, Promtail agents on all nodes ship logs.

What gets collected:

Docker container logs (auto-discovered)
Systemd journal logs
Fail2Ban events (security)
Tailscale connectivity logs
GPU and DCGM logs

Retention: 30 days with automatic compaction every 10 minutes. 100MB query cache.

All logs get labeled by host, node type, container name, compose service. Makes filtering in Grafana easy.

Local AI inference

Qwen3.5-35B-A3B runs locally through llama.cpp with no per-request API cost. The current setup is tuned around a sparse Mixture-of-Experts workload and the memory limits of an RTX 3060.

Approximately 101 generated tokens per second
Down-tuned GPU voltage/power profile
PCIe model movement is the main measured bottleneck
FastAPI server with server-sent-event streaming
Prometheus metrics for inference latency and GPU behavior

The useful engineering work is not only loading the model. It is measuring where time and memory move across the system, then tuning the runtime and hardware profile around the real bottleneck.

Retrieval experiments

I previously experimented with document chunking, embeddings, Qdrant retrieval, and cross-encoder reranking. Those experiments helped me understand the components of a RAG pipeline, but they are not part of the currently running inference service.

Tool use

The service exposes bounded infrastructure diagnostics through FastAPI. Retrieval and Qdrant are not required for the current tool workflow.

Ansible Automation

All deployment and health checks run through Ansible. SSH key auth, no passwords.

Playbooks I use: | Playbook | What it does | |----------|--------------| | playbook-monitoring.yml | Deploy Prometheus, exporters, Grafana | | playbook-logging.yml | Deploy Loki and Promtail agents | | 01-playbook-network.yml | Ping tests, speedtest across nodes | | 02-playbook-health.yml | CPU, RAM, disk checks | | 03-playbook-db-replica-set.yml | Replica set status, replication lag | | 04-playbook-backups.yml | mongodump with compression and retention |

Quick commands I run often:

ansible all -m ping
ansible all -m shell -a "df -h /"
ansible all -m shell -a "docker ps"

Docker Architecture

Every service runs in its own network namespace via Tailscale sidecar pattern. No shared Docker networks, no exposed ports on host.

Sidecar Pattern:

┌─────────────────────────────────────┐
│  Tailscale Container (ts-fastapi)   │  ← Gets its own Tailscale IP
│  network_mode: bridge               │
└─────────────────────────────────────┘
              ▲
              │ shares network namespace
              ▼
┌─────────────────────────────────────┐
│  App Container (fastapi)            │  ← Uses sidecar's network
│  network_mode: service:ts-fastapi   │
└─────────────────────────────────────┘

Each app appears as its own device on Tailscale. Jugamos alone has 3 Tailscale nodes: nextjs, fastapi, nginx.

Container Security:

Non-root users in all Dockerfiles (adduser, USER fastapi)
Multi-stage builds—no build tools in production images
Alpine base images for minimal attack surface
No privileged mode (except cAdvisor for metrics)
Runtime secrets are passed through environment variables and never baked into images

Build Example (FastAPI):

FROM python:3.12-alpine AS production
RUN addgroup -g 1001 fastapi \
    && adduser -u 1001 -G fastapi -s /bin/sh -D fastapi
USER fastapi

Running Applications

Three public applications currently run with dedicated Tailscale sidecars. CleanFuture is being prepared for its production return.

YelpCamp - Camp review site. Node.js, Cloudinary for images, Mapbox for maps, MongoDB backend.

Portfolio - This site. Next.js with Google Analytics. Exposed via Cloudflare Tunnel.

Jugamos - Game platform. Next.js frontend, FastAPI backend, nginx reverse proxy. JWT auth, avatar uploads. Three separate Tailscale nodes for isolation.

CleanFuture.io - Next.js and FastAPI renewable-energy planning product currently being simplified for a production relaunch.

Security

Layered security across all nodes. Two main tools: Fail2Ban for intrusion detection, HashiCorp Vault for secrets management.

Fail2Ban:

Running on all 3 nodes, monitors SSH
3 failed attempts = 1 hour IP ban
Ban events shipped to Loki, viewable in Grafana

HashiCorp Vault:

Centralized secrets management
API tokens, database credentials, Tailscale keys
Nothing hardcoded in repos

SSH Hardening:

Key-only auth (ED25519), no passwords
Root login disabled
AllowUsers whitelist
X11 forwarding off
Idle timeout kicks inactive sessions

Firewall:

UFW default deny on all nodes
Only Tailscale subnet (100.64.0.0/10) allowed
Zero public ports exposed

Database:

keyFile auth between replica members
Separate users for admin, app, monitoring
Credentials pulled from Vault, set as env vars

Containers:

Loki runs non-root with no-new-privileges
Tailscale sidecars isolate each service
Only cAdvisor runs privileged (required for metrics)

Notifications

Slack: Webhook integration for Grafana alerts, backup status, Fail2Ban triggers.

Email: Routed through Cloudflare Email Workers. No local SMTP server to maintain.

What's Next

Continue profiling Qwen MoE inference around PCIe transfers, memory use, power, and latency
Make inference benchmarks repeatable across runtime and hardware configurations
Expand Grafana alerting and bounded infrastructure automation
Add geographically separate small nodes when the workload justifies them
Keep application infrastructure proportional to real product traffic