Docker
Sirene is split into two Docker images:
ghcr.io/kevinbonnoron/sirene— Nginx + React client + Hono API + PocketBaseghcr.io/kevinbonnoron/sirene-inference— Python inference server (FastAPI + uvicorn)
All model management (download, install check, deletion) is handled by the inference server. The server container only needs persistent storage for the PocketBase database.
Quick Install
The install script sets up everything automatically:
curl -sSL https://raw.githubusercontent.com/KevinBonnoron/sirene/main/install.sh | bashIt will:
- Check for Docker and Docker Compose
- Ask for your deployment mode (see below)
- Generate PocketBase admin credentials
- Create a
docker-compose.ymland.envfile - Pull and start the containers
Deployment Modes
Local CPU
Everything runs on the same machine. Backend dependencies (torch, onnxruntime, etc.) are downloaded on demand when you install a model.
services:
server:
image: ghcr.io/kevinbonnoron/sirene:latest
ports:
- "80:80"
volumes:
- sirene-data:/app/db/pb_data
env_file:
- .env
restart: unless-stopped
inference:
image: ghcr.io/kevinbonnoron/sirene-inference:latest
volumes:
- sirene-models:/app/data/models
- sirene-packages:/app/data/packages
restart: unless-stopped
volumes:
sirene-data:
sirene-models:
sirene-packages:Local CUDA (NVIDIA GPU)
Same as above but the inference service uses the CUDA image and requires the NVIDIA Container Toolkit.
inference:
image: ghcr.io/kevinbonnoron/sirene-inference:cuda
volumes:
- sirene-models:/app/data/models
- sirene-packages:/app/data/packages
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]Remote Inference (RunPod)
Run the server locally (or on a cheap VPS) and offload inference to a RunPod GPU pod. This avoids needing a local GPU — model files and Python dependencies live entirely on the pod.
Why a Pod and not Serverless? Sirene's inference service is a long-running HTTP server (FastAPI). RunPod Serverless requires a custom handler format and would cold-start on every request, which is too slow for loading TTS models into GPU memory. A GPU Pod keeps the service running and ready.
1. Create a GPU Pod on RunPod
- Go to runpod.io/console/pods and click + GPU Pod
- Pick a GPU (RTX 3090, RTX 4090, A40, etc. — 16 GB+ VRAM recommended)
- Under Container Image, enter:
ghcr.io/kevinbonnoron/sirene-inference:cuda - Under Expose HTTP Ports, add:
8000 - Under Environment Variables, add:
INFERENCE_DEVICE=cuda
- (Recommended) Attach a Network Volume mounted at
/app/datato persist downloaded models and packages across pod restarts - Click Deploy
Once the pod is running, RunPod gives you a proxy URL. Find it in the pod's Connect tab — it looks like:
https://{pod-id}-8000.proxy.runpod.netYou can verify the inference service is ready:
curl https://{pod-id}-8000.proxy.runpod.net/health2. Deploy the server locally
On your machine or VPS, run only the server container and point INFERENCE_URL to your RunPod pod:
services:
server:
image: ghcr.io/kevinbonnoron/sirene:latest
ports:
- "80:80"
volumes:
- sirene-data:/app/db/pb_data
environment:
- INFERENCE_URL=https://{pod-id}-8000.proxy.runpod.net
env_file:
- .env
restart: unless-stopped
volumes:
sirene-data:Replace {pod-id} with your actual pod ID from RunPod.
Tips
- Cost — stop the pod from the RunPod dashboard when you're not using it. Models stored on a Network Volume will still be there when you restart.
- Spot pods — cheaper but can be interrupted. Fine for non-critical usage.
- Latency — audio generation involves large responses. Pick a RunPod region close to your server for best performance.
Environment Variables
Server (sirene)
| Variable | Default | Description |
|---|---|---|
PB_SUPERUSER_EMAIL | — | PocketBase admin email (created on first start) |
PB_SUPERUSER_PASSWORD | — | PocketBase admin password |
VITE_PB_URL | /db | PocketBase URL as seen by the browser |
VITE_SERVER_URL | /api | API server URL as seen by the browser |
INFERENCE_URL | http://inference:8000 | URL of the inference service |
Inference (sirene-inference)
| Variable | Default | Description |
|---|---|---|
INFERENCE_DEVICE | cpu | cpu or cuda |
INFERENCE_MODELS_PATH | /app/data/models | Path to model files |
SIRENE_PACKAGES_DIR | /app/data/packages | Persistent dir for lazily installed backend deps |
Volumes
| Volume | Service | Container Path | Description |
|---|---|---|---|
sirene-data | server | /app/db/pb_data | PocketBase database and uploaded files |
sirene-models | inference | /app/data/models | Downloaded TTS models |
sirene-packages | inference | /app/data/packages | Lazily installed Python backend packages |
Build from Source
# Server image
docker build -f docker/Dockerfile -t sirene .
# Inference image (CPU)
docker build -f docker/Dockerfile.inference -t sirene-inference .
# Inference image (CUDA)
docker build -f docker/Dockerfile.inference -t sirene-inference:cuda --build-arg INFERENCE_VARIANT=cuda .Architecture
Nginx routes all traffic in the server container:
/api→ Hono server (port 3000)/db→ PocketBase (port 8090)/→ React SPA (static files)
The Hono server delegates all model operations (download, install check, deletion) to the inference container via its REST API at INFERENCE_URL. Backend Python dependencies (torch, onnxruntime, etc.) are not bundled in the image — they are installed on demand into the sirene-packages volume the first time a model using that backend is installed.