TL;DR: LLM Gateway is an orchestration layer that exposes high-level API services (extraction, summarization, etc.) on top of LLM inference engines, managing context overflow, job tracking, and result consistency for production applications.
LLM Gateway is designed as an internal backend service, not a public-facing API:
- No built-in authentication - Deploy behind your application's existing auth layer
- No API key management - Your backend calls LLM Gateway directly
- Multi-tenant support - Use
organization_idto isolate services and jobs between tenants - Trust boundary - LLM Gateway trusts all incoming requests; secure at the network level
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Your App │────▶│ Your Backend │────▶│ LLM Gateway │
│ (public) │ │ (auth, authz) │ │ (internal) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Pass organization_id when creating / executing services and to filter jobs and results per tenant.
Calling LLM APIs directly works for simple use cases, but production applications face real challenges:
- Context limits: Long documents exceed model context windows, causing failures
- No job management: Difficult to track progress, handle failures, or manage concurrent requests
- Cost unpredictability: Token usage and costs are unknown until after processing
- No result management: No caching, versioning, or structured output handling
LLM Gateway solves these problems with a service-oriented architecture that handles document chunking, job queuing with real-time updates, token tracking with cost estimation, and comprehensive result management including caching and versioning.
Each application implementing LLM features must handle its own complexity:
┌──────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Messaging App │ │ Document App │ │ Meeting App │ │
│ │ │ │ │ │ │ │
│ │ Feature: │ │ Feature: │ │ Feature: │ │
│ │ "Email Response"│ │ "Doc Formatting"│ │ "ASR Summary" │ │
│ │ │ │ │ │ │ │
│ │ - retry logic │ │ - retry logic │ │ - retry logic │ │
│ │ - job tracking │ │ - job tracking │ │ - job tracking │ │
│ │ - token count │ │ - token count │ │ - token count │ │
│ │ - long doc │ │ - long doc │ │ - long doc │ │
│ │ handling │ │ handling │ │ handling │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ │ DUPLICATED CODE │ DUPLICATED CODE │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ OpenAI / vLLM API │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
Issues:
• Document too long → FAILURE
• API rate limit → Silent failure
• No visibility on running jobs
• Costs known only at end of month
• Every team reinvents the wheel
┌──────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Messaging App │ │ Document App │ │ Meeting App │ │
│ │ │ │ │ │ │ │
│ │ (business logic │ │ (business logic │ │ (business logic │ │
│ │ only) │ │ only) │ │ only) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ │ POST /api/v1/services/{id}/execute │ │
│ │ WS /ws/jobs/{id} │ │
│ └───────────────────────┼───────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ LLM GATEWAY │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Service │ │ Service │ │ Service │ │ │
│ │ │ "Email │ │ "Doc │ │ "Meeting │ ... │ │
│ │ │ Response" │ │ Formatting"│ │ Summary" │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ Common infrastructure: │ │
│ │ • Auto chunking (all languages) • Retry + backoff │ │
│ │ • Job queue (Celery) • Token/cost tracking │ │
│ │ • Real-time WebSocket • DOCX/PDF export │ │
│ └───────────────────────────┬───────────────────────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────┐ │
│ │ OpenAI / vLLM / Ollama API │ │
│ └───────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
Result:
• Uniform integration: same API for all services
• Document too long → auto chunking, success
• Rate limit → transparent retry, success
• Real-time job monitoring via WebSocket
• Costs tracked per request
Read more: How LLM Gateway Works
Several open-source solutions exist for routing requests to multiple LLM providers:
- LiteLLM - Python SDK/proxy supporting 100+ providers with unified OpenAI format
- TensorZero - High-performance Rust gateway (<1ms latency)
- Portkey - AI gateway with caching, observability, guardrails
These are primarily proxies/routers - they forward requests to LLMs with added features like load balancing and retries. LLM Gateway goes further by handling document processing logic:
| Feature | LiteLLM / Portkey / etc. | LLM Gateway |
|---|---|---|
| Multi-provider routing | Yes | Yes |
| Cost tracking & caching | Yes | Yes |
| Iterative processing | No | Yes (chunking, rolling context) |
| Async job queue | Limited | Yes (Celery, priorities) |
| Long documents | Client-side handling | Gateway-side handling |
| Result versioning | No | Yes |
| DOCX/PDF export | No | Yes (with templates) |
| Dynamic placeholder extraction | No | Yes (LLM-based) |
| Output granularity control | No | Yes (batch size → detail level) |
git clone -b next https://github.com/linto-ai/llm-gateway.git
cd llm-gateway
docker compose up -d
# API: http://localhost:8000/docs
# Frontend: http://localhost:8001That's it! The default configuration works out-of-the-box. Database migrations and seed data (prompts, presets, document templates) are applied automatically on first start.
To customize settings, copy .env.example to .env and edit as needed.
For development with hot-reload:
cp .env.example .env # Required for dev mode
docker compose -f docker-compose.dev.yml up --build┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Frontend │────▶│ FastAPI │────▶│ Celery │
│ (Next.js) │ │ (REST/WS) │ │ (Workers) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ PostgreSQL │ │ Redis │
│ (Services, │ │ (Tasks, │
│ Jobs, etc) │ │ Broker) │
└─────────────┘ └─────────────┘
The gateway supports two processing modes for handling documents of varying lengths:
| Mode | Description | Use Case |
|---|---|---|
| Single Pass | Process entire document in one LLM call | Short documents that fit in context window |
| Iterative | Process in batches with rolling context | Long documents (ASR transcriptions), progressive output |
When a document exceeds the context window of a single-pass flavor, the gateway can automatically fall back to an iterative flavor:
{
"fallback_applied": true,
"original_flavor_name": "fast-single-pass",
"flavor_name": "iterative-fallback",
"fallback_reason": "Input (45000 tokens) exceeds context limit (32000 available)"
}Configure fallback in flavor settings: fallback_flavor_id + auto_fallback_to_iterative.
When a provider or model becomes unavailable (API errors, rate limits, downtime), the gateway can automatically switch to an alternative:
{
"failover_flavor_id": "uuid-of-backup-flavor",
"failover_applied": true,
"original_flavor_name": "gpt-4-primary",
"flavor_name": "claude-backup",
"failover_reason": "Provider unavailable: 503 Service Unavailable"
}Configure failover chains to ensure service continuity even when primary providers fail.
For long documents that exceed context limits, the gateway processes content in batches:
- Turn normalization (
create_new_turn_after): Splits long turns into smaller chunks using universal sentence segmentation (works with all languages) - Batch creation (
max_new_turns): Groups turns into batches sent to the LLM with rolling context
What is a "turn"? A turn is a line or paragraph separated by newline. In ASR context, turns can have a speaker prefix (Speaker : text). The gateway handles both formats transparently.
| Parameter | Description |
|---|---|
create_new_turn_after |
Token threshold for splitting long turns |
max_new_turns |
Turns per batch - controls output granularity |
summary_turns |
Previous summary turns kept in context |
reduce_summary |
Enable final consolidation step on iterative processing |
Controlling output detail: Fewer turns per batch = more LLM passes = more detailed output preserving individual speaker interventions.
LLM Gateway includes integrated tokenizer management supporting both tiktoken (OpenAI) and HuggingFace tokenizers. This ensures accurate token counting for any model, with per-flavor tokenizer override capability.
After processing, the gateway can automatically extract structured metadata from results using a configurable extraction prompt. Fields are defined via document templates attached to the service.
Document Categorization: Pass tags at execution time to classify documents:
curl -X POST ".../execute" -d '{
"input": "Meeting transcript...",
"context": {
"tags": [
{"name": "project-update", "description": "Project status discussions"},
{"name": "budget", "description": "Financial topics"}
]
}
}'Response includes matched tags with confidence scores:
{
"categorization": {
"matched_tags": [{"name": "project-update", "confidence": 0.95, "mentions": 3}],
"unmatched_tags": ["budget"]
}
}Create a .env file:
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | postgresql+asyncpg://... |
SERVICES_BROKER |
Redis broker URL | redis://task-broker-redis:6379/0 |
ENCRYPTION_KEY |
Key for encrypting API keys | (required) |
CELERY_WORKERS |
Celery worker count | 1 |
The gateway automatically retries failed LLM API calls (rate limits, timeouts, 5xx errors) using exponential backoff.
| Variable | Description | Default |
|---|---|---|
API_MAX_RETRIES |
Maximum retry attempts | 6 |
API_RETRY_MIN_DELAY |
Minimum delay between retries (seconds) | 1 |
API_RETRY_MAX_DELAY |
Maximum delay between retries (seconds) | 60 |
| Endpoint | Method | Description |
|---|---|---|
/api/v1/services |
GET | List all services |
/api/v1/services/{id}/execute |
POST | Execute a service |
/api/v1/jobs/{id} |
GET | Get job status |
/ws/jobs/{id} |
WS | Real-time job updates |
# Execute a service
curl -X POST "http://localhost:8000/api/v1/services/{service_id}/execute" \
-H "Content-Type: application/json" \
-d '{"input": "Text to process...", "flavor_name": "default"}'
# Monitor job via WebSocket
wscat -c "ws://localhost:8000/ws/jobs/{job_id}"Two Docker Compose configurations are available:
| File | Use Case | Description |
|---|---|---|
docker-compose.yml |
Production (default) | Works out-of-the-box, uses published images |
docker-compose.dev.yml |
Development | Hot-reload, volume mounts, requires .env |
# Quick start (works immediately with defaults)
docker compose up -d
# Or with custom configuration
cp .env.example .env
vim .env # Edit settings
docker compose up -dCreate a .env file with production values:
# Database
DATABASE_URL=postgresql+asyncpg://user:password@postgres:5432/llm_gateway
# Redis broker
SERVICES_BROKER=redis://redis:6379/0
# Security - generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
ENCRYPTION_KEY=your-fernet-key-here
# CORS origins (comma-separated)
CORS_ORIGINS=https://your-domain.com
# Frontend URLs
NEXT_PUBLIC_API_URL=https://api.your-domain.com
NEXT_PUBLIC_WS_URL=wss://api.your-domain.comAdjust CELERY_WORKERS environment variable to scale processing capacity:
# In docker-compose.yml or .env
CELERY_WORKERS=4 # Number of concurrent task workersFor high-throughput deployments, consider running dedicated Celery worker containers:
docker compose up -d --scale celery-worker=4- How It Works - Architecture and processing modes
- API Integration - Complete integration guide
- Model Limits Guide - Token limits configuration
- Flavor Presets Guide - Pre-configured settings
- Document Templates Guide - DOCX export
- Security Guide - Credential rotation and security
Copyright (c) 2024-2025 LINAGORA - LinTO.ai