LLM Providers
This guide covers setup and configuration for each LLM provider supported by ContextRouter. Choose the provider(s) that best fit your use case.
Google Vertex AI
Best for: Production deployments, multimodal, enterprise requirements
Google’s Vertex AI provides access to Gemini models with enterprise SLAs and built-in grounding capabilities.
Setup
- Create a GCP project with Vertex AI enabled
- Authenticate:
# Application Default Credentials (development)gcloud auth application-default login
# Service Account (production)export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json- Configure:
[vertex]project_id = "your-gcp-project"location = "us-central1" # or europe-west1, asia-northeast1, etc.Available Models
# Fast, cost-effectivellm = model_registry.create_llm("vertex/gemini-2.0-flash", config=config)
# Ultra-lightweightllm = model_registry.create_llm("vertex/gemini-2.0-flash-lite", config=config)
# Most capablellm = model_registry.create_llm("vertex/gemini-2.5-pro", config=config)Features
- ✅ Native multimodal (text, images, audio, video)
- ✅ Structured output with JSON mode
- ✅ Built-in grounding with Google Search
- ✅ Function calling / tool use
- ✅ Long context (up to 1M tokens on Pro)
OpenAI
Best for: Ecosystem compatibility, proven quality, function calling
Setup
export OPENAI_API_KEY=sk-...Or in settings:
[openai]api_key = "${OPENAI_API_KEY}"organization = "org-..." # OptionalAvailable Models
# GPT-4o (multimodal)llm = model_registry.create_llm("openai/gpt-4o", config=config)
# GPT-4o Mini (cost-effective)llm = model_registry.create_llm("openai/gpt-4o-mini", config=config)
# o1 (reasoning)llm = model_registry.create_llm("openai/o1", config=config)Features
- ✅ Vision (GPT-4o)
- ✅ Function calling
- ✅ JSON mode
- ✅ Whisper for audio transcription
Anthropic
Best for: Long documents, safety-focused applications, nuanced reasoning
Setup
export ANTHROPIC_API_KEY=sk-ant-...Available Models
# Claude 4 Sonnet (balanced)llm = model_registry.create_llm("anthropic/claude-sonnet-4", config=config)
# Claude 3.5 Sonnetllm = model_registry.create_llm("anthropic/claude-3.5-sonnet", config=config)
# Claude 3 Opus (most capable)llm = model_registry.create_llm("anthropic/claude-3-opus", config=config)Features
- ✅ 200K token context window
- ✅ Vision support
- ✅ Tool use
- ✅ Constitutional AI safety
Groq
Best for: Ultra-fast inference, real-time applications
Groq provides extremely fast inference for open-source models.
Setup
export GROQ_API_KEY=gsk_...Available Models
# Llama 3.3 70Bllm = model_registry.create_llm("groq/llama-3.3-70b-versatile", config=config)
# Mixtral 8x7Bllm = model_registry.create_llm("groq/mixtral-8x7b-32768", config=config)
# Whisper (ASR)llm = model_registry.create_llm("groq/whisper-large-v3", config=config)Features
- ✅ Sub-second latency for most queries
- ✅ Open-source models
- ✅ Whisper ASR integration
OpenRouter
Best for: Access to hundreds of models through one API
OpenRouter aggregates models from many providers.
Setup
export OPENROUTER_API_KEY=sk-or-...Available Models
# DeepSeek R1 (reasoning)llm = model_registry.create_llm("openrouter/deepseek/deepseek-r1", config=config)
# Qwen 2.5llm = model_registry.create_llm("openrouter/qwen/qwen-2.5-72b", config=config)
# Many more at openrouter.ai/modelsLocal Models (Ollama)
Best for: Privacy, offline use, development, cost savings
Setup
- Install and start Ollama:
# Install (macOS/Linux)curl -fsSL https://ollama.com/install.sh | sh
# Start serverollama serve
# Pull a modelollama pull llama3.2- Configure:
export LOCAL_OLLAMA_BASE_URL=http://localhost:11434/v1Available Models
# Llama 3.2 (latest)llm = model_registry.create_llm("local/llama3.2", config=config)
# Mistralllm = model_registry.create_llm("local/mistral", config=config)
# Code Llamallm = model_registry.create_llm("local/codellama", config=config)Local Models (vLLM)
Best for: High-throughput production serving of open models
Setup
- Start vLLM server:
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8000- Configure:
export LOCAL_VLLM_BASE_URL=http://localhost:8000/v1Usage
llm = model_registry.create_llm( "local-vllm/meta-llama/Llama-3.1-8B-Instruct", config=config)HuggingFace Transformers
Best for: Running models directly in-process, specialized tasks (STT, classification)
Setup
pip install contextrouter[hf-transformers]Usage
# Small model for testingllm = model_registry.create_llm("hf/distilgpt2", config=config)
# TinyLlama for chatllm = model_registry.create_llm("hf/TinyLlama/TinyLlama-1.1B-Chat-v1.0", config=config)
# Whisper for ASRasr = model_registry.create_llm( "hf/openai/whisper-tiny", config=config, task="automatic-speech-recognition")Note: HuggingFace models run locally and require sufficient RAM/GPU. Use for specialized tasks or small models.
Best Practices
For Production RAG
Use reliable models with good instruction following:
vertex/gemini-2.0-flash— Best balance of speed/qualityopenai/gpt-4o-mini— Reliable, good value
For Structured Output (JSON)
Some models are better at following JSON schema requirements:
- ✅
vertex/gemini-2.0-flash - ✅
openai/gpt-4o-mini - ⚠️ Local models may struggle with complex JSON
For Cost Optimization
# Try cheap/free first, fall back to premiummodel = model_registry.get_llm_with_fallback( key="local/llama3.2", fallback_keys=["groq/llama-3.3-70b", "vertex/gemini-2.0-flash"], strategy="cost-priority")For Maximum Reliability
# Multiple fallbacks across providersmodel = model_registry.get_llm_with_fallback( key="vertex/gemini-2.0-flash", fallback_keys=["openai/gpt-4o", "anthropic/claude-sonnet-4"], strategy="fallback")