Skip to content

Introduction

Sirene is a multi-backend text-to-speech router with a web interface. It provides a single entry point to generate speech via multiple TTS backends with custom voice management via audio samples.

Tech Stack

LayerTechnology
JS RuntimeBun
MonorepoTurborepo
ServerHono
InferenceFastAPI + ONNX Runtime / PyTorch
DB / Realtime / FilesPocketBase
FrontendReact 19 + Vite + TanStack Router
UITailwind CSS + Radix UI
StateTanStack Query + PocketBase SSE
LintingBiome

Architecture

Client (React)
  |
Nginx (reverse proxy in production)
  |-- /api  --> Hono Server (Bun, port 3000)
  |               |-- PocketBase (port 8090)
  |               '-- Inference FastAPI (port 8000)
  |-- /db   --> PocketBase
  '-- /     --> React SPA

Responsibilities

Client (React) — User interface. Connects to PocketBase via SSE for real-time updates (download progress, generation status). Communicates with the Hono server for actions.

Server (Hono) — Pure orchestrator, zero inference. Receives requests, validates, resolves voices/parameters, forwards to the Python service for inference, and writes results to PocketBase.

Inference (FastAPI) — All TTS inference. A single PyTorch runtime, one GPU, lazy-loading models, memory cache. Downloads models on demand into a Docker volume.

PocketBase — SQLite database, file storage (audio samples, generations), real-time SSE subscriptions, admin UI for debugging.

Monorepo Structure

sirene/
├── client/          # React + Vite + Tailwind
├── server/          # Hono (orchestrator, zero inference)
├── inference/       # FastAPI + TTS backends (Kokoro, Qwen, F5, Piper)
├── shared/          # Shared TypeScript types
├── db/              # PocketBase (migrations, data)
├── docs/            # VitePress documentation
├── docker/          # Dockerfile, nginx, supervisord, entrypoint
└── data/models/     # Downloaded models (gitignored)

Supported Backends

BackendVoice CloningStreamingLanguages
KokoroEN, FR, JA, KO, ZH
Qwen3-TTSYes10+ languages
F5-TTSYesYesMultilingual
Piper26 languages, 40+ voices
CosyVoiceYesYes (~150ms)9 languages
OpenAudio S1YesMultilingual
ChatterboxYesEN + 23 languages