VRAM Requirements for Self-Hosted LLM Production Deployments
Running production LLM inference requires careful VRAM planning. Total VRAM requirements consist of model weights, KV cache that grows with concurrent users and context length, and system overhead.
Modern inference engines like vLLM with PagedAttention can achieve 24x higher throughput than basic implementations by reducing memory waste from 60-80% to under 4%.
VRAM Estimation Formulaโ
The formula for estimating total VRAM requirements in production:
Components Explainedโ
| Component | Variable | Description |
|---|---|---|
| Model Weights | Fixed VRAM for quantized model parameters (depends on model size and quantization level) | |
| System Overhead | Inference framework, CUDA, and OS overhead (~0.5-1 GB) | |
| KV Cache per Request | Memory per active user session (grows with context length) | |
| Concurrent Requests | N/A | Number of simultaneous inference sessions |
For Llama 3 8B at full 8K context, the KV cache requires approximately 1.1 GB per sequence. This grows linearly with context length and batch size.
Model VRAM Requirementsโ
Popular Models (4-bit Quantization)โ
- Small Models (7-9B)
- Medium Models (13-34B)
- Large Models (70B+)
| Model | Parameters | (Q4) | per 1K tokens | Base VRAM |
|---|---|---|---|---|
| Mistral 7B | 7B | 4.5 GB | ~0.10 GB | ~5 GB |
| Llama 3 8B | 8B | 5.1 GB | ~0.11 GB | ~6 GB |
| Gemma 2 9B | 9B | 5.8 GB | ~0.12 GB | ~7 GB |
| Model | Parameters | (Q4) | per 1K tokens | Base VRAM |
|---|---|---|---|---|
| Llama 3 13B | 13B | 8.0 GB | ~0.15 GB | ~9 GB |
| Yi 34B | 34B | 20 GB | ~0.40 GB | ~21 GB |
| Model | Parameters | (Q4) | per 1K tokens | Base VRAM |
|---|---|---|---|---|
| Mixtral 8x7B | 47B total | 28 GB | ~0.60 GB | ~29 GB |
| Llama 3 70B | 70B | 40 GB | ~0.80 GB | ~41 GB |
| Qwen 2.5 72B | 72B | 42 GB | ~0.85 GB | ~43 GB |
Each 1,000 tokens consumes approximately 0.11GB additional VRAM for 7-8B models. Long conversations can quickly exceed model weight memory usage.
Critical Production Considerationsโ
1. KV Cache Reality Checkโ
For an A10 GPU with 24GB VRAM serving Llama 2 7B:
- Model weights: 14 GB (2 bytes per parameter ร 7B)
- Available for KV cache: 10 GB
- Total capacity: ~20K tokens (including prompts)
This severely limits concurrent users with long contexts.
# Example calculation for A10 GPU
model_vram = 14 # GB for Llama 2 7B in FP16
gpu_vram = 24 # GB total
overhead = 0.5 # GB
available_kv = gpu_vram - model_vram - overhead # 9.5 GB
tokens_per_gb = 2048 # approximate for 7B models
max_tokens = available_kv * tokens_per_gb # ~19,456 tokens
# With 8K context per user:
max_concurrent_users = max_tokens / 8192 # ~2.4 users
Most concurrent requests don't use maximum context length. With an average context window of 4K tokens, you can support significantly more users than worst-case calculations suggest.
2. Modern Optimization: PagedAttentionโ
vLLM with PagedAttention revolutionizes memory efficiency:
- Traditional systems: 60-80% memory waste
- vLLM with PagedAttention: Less than 4% memory waste
- Performance gain: Up to 24x higher throughput
Key Features:
- Paged Memory Management: Breaks KV cache into fixed-size blocks that can be stored non-contiguously
- Prefix Caching: When multiple requests share the same prompt (e.g., system prompts), stores only one copy
- Copy-on-Write: Creates new blocks only when sequences diverge
- Continuous Batching: Dynamically adds/removes requests at iteration level
Realistic Deployment Examplesโ
Example 1: Small Production (10-20 users)โ
Scenario: Customer support chatbot with moderate traffic
- Specifications
- Calculation
- Cost Analysis
- Model: Llama 3 8B (Q4_K_M quantization)
- Average context: 4K tokens per user
- Peak concurrent users: 20
Base model: 5.1 GB
System overhead: 0.5 GB
Per-user KV (4K): 0.44 GB
20 users: 5.1 + 0.5 + (0.44 ร 20) = 14.4 GB
Recommended GPU: RTX 4060 Ti 16GB (dev) or L4 16GB (production)
| Option | Setup | Monthly Cost | Notes |
|---|---|---|---|
| Self-hosted L4 | Cloud | ~$450/mo | 24/7 operation |
| Self-hosted RTX 4060 Ti | On-prem | $499 one-time | Dev/testing only |
| Cloud (spot) | Cloud | ~$200-300/mo | Interruptible |
Example 2: Medium Production (50-100 users)โ
Scenario: Internal AI assistant for enterprise
- Specifications
- Calculation
- Architecture
- Model: Mistral 7B (Q4_K_M) with vLLM
- Average context: 3K tokens
- Peak concurrent users: 100
- Optimizations: PagedAttention + prefix caching
Base model: 4.5 GB
System overhead: 0.5 GB
Per-user KV (avg 3K): 0.30 GB
Theoretical (100 users): 4.5 + 0.5 + (0.30 ร 100) = 35 GB
With PagedAttention efficiency and prefix caching:
Actual requirement: ~20-25 GB
Recommended GPU: A10G 24GB or RTX 4090 24GB (development)
Example 3: Enterprise Scale (200+ users)โ
Scenario: High-traffic SaaS platform
- Specifications
- Calculation
- Performance
- Model: Llama 3 70B (AWQ 4-bit)
- Average context: 6K tokens
- Peak concurrent users: 200-500
- Setup: Multi-GPU with tensor parallelism
Model weights: 40 GB (AWQ 4-bit)
System overhead: 1 GB
Tensor parallelism: Distributed across 2 GPUs
Per-GPU utilization: ~35 GB (model) + KV cache
Total system: 2ร A100 80GB or 2ร H100 80GB
Deployment: vLLM with continuous batching and tensor parallelism
Expected Throughput:
- TTFT (Time to First Token): 200-400ms
- Tokens/second: 130-250 (depending on batch size)
- Concurrent requests: 200-500 with dynamic batching
Infrastructure:
- 2ร A100 80GB: ~$3,000-6,000/month (cloud)
- 2ร H100 80GB: ~$12,000-20,000/month (cloud)
- On-premises: 500-1,000/mo power
Production GPU Recommendationsโ
Datacenter GPUs: Cloud vs. Market Pricingโ
| Use Case | GPU | Cloud Price | Market Price | Rationale |
|---|---|---|---|---|
| Development/Testing | RTX 4090 (24GB) | N/A | $1,600-2,000 | Exceptional for developers and small teams, handles 7B-13B models comfortably for prototyping |
| Development/Testing | RTX 5090 (32GB) | N/A | $2,000-2,500 | Latest with 32GB VRAM, 30% faster than 4090 for transformers |
| Small Production | L4 (24GB) | $0.60-1.00/hr | $4,000-6,000 | Power efficient (72W), compact, handles 7B-13B models for chatbots |
| Small Production | RTX A4000 (16GB) | $0.40-0.70/hr | $1,500-2,000 | Enterprise-rated workstation GPU with ECC memory |
| Small Production | A10G (24GB) | $1.00-1.50/hr | $5,000-7,000 | Cost-efficient for deployments, widely available in cloud |
| Standard Production | A100 40GB | $1.50-2.50/hr | $10,000-12,000 | Proven, efficient, still powers production at scale |
| Standard Production | A100 80GB | $2.50-4.00/hr | $15,000-18,000 | Double memory for larger models or more concurrent users |
| High-Performance LLM | H100 80GB (PCIe) | $4.00-6.00/hr | $25,000-30,000 | 2-3x faster than A100 for LLM workloads with FP8 support |
| High-Performance LLM | H100 80GB (SXM) | $6.00-10.00/hr | $30,000-35,000 | NVLink for multi-GPU, best for transformer-heavy models |
| Multi-Model/Long-Context | H200 (141GB) | $8.00-12.00/hr | $40,000-55,000 | 76% more memory and 43% higher bandwidth than H100 |
| Frontier-Scale | B200 (192GB) | $12.00-18.00/hr | $30,000-35,000 | Availability severely constrained with 3-6 month lead times |
Cloud prices vary by provider (AWS, GCP, Azure, RunPod, Lambda Labs). Market prices are approximate as of November 2025 and fluctuate based on supply/demand.
Monthly Cost Examples (24/7 Operation)โ
- Small Scale
- Medium Scale
- Enterprise Scale
Target: 10-20 concurrent users
| Configuration | Monthly Cost | TCO (1 year) |
|---|---|---|
| 1ร L4 (cloud) | $450-750 | $5,400-9,000 |
| 1ร RTX 4090 (on-prem) | $50-100* | $2,100** |
*Power costs only
**Includes hardware cost
Target: 50-100 concurrent users
| Configuration | Monthly Cost | TCO (1 year) |
|---|---|---|
| 1ร A10G (cloud) | $750-1,100 | $9,000-13,200 |
| 1ร A100 40GB (cloud) | $1,100-1,800 | $13,200-21,600 |
| 1ร A100 80GB (on-prem) | $400-600* | $19,800** |
*Power costs only
**Includes hardware cost ($15,000)
Target: 200+ concurrent users
| Configuration | Monthly Cost | TCO (1 year) |
|---|---|---|
| 2ร A100 80GB (cloud) | $3,600-5,800 | $43,200-69,600 |
| 2ร H100 80GB (cloud) | $8,600-14,400 | $103,200-172,800 |
| 2ร H100 80GB (on-prem) | $1,000-1,500* | $78,000** |
*Power costs only
**Includes hardware cost ($60,000)
โ ๏ธ Consumer GPU Warning for Productionโ
Consumer GPUs (RTX 4090/5090) are NOT recommended for production deployments despite attractive pricing. Here's why:
| Feature | Consumer GPUs | Datacenter GPUs |
|---|---|---|
| ECC Memory | โ No (data corruption risk) | โ Yes (data integrity) |
| 24/7 Operation | โ Not rated | โ Designed for continuous use |
| Warranty | 1-3 years, consumer support | 3-5 years, enterprise SLAs |
| Multi-GPU Scaling | Limited/No NVLink | โ NVLink, NVSwitch |
| Power Management | Gaming-focused | Datacenter-optimized |
| MIG Support | โ No | โ Yes (resource partitioning) |
| Thermal Design | Intermittent loads | Continuous operation |
Acceptable Consumer GPU Use Cases:
- Development and testing environments
- Proof-of-concept deployments
- Personal projects and learning
- Budget-constrained startups (with awareness of risks)
Cost Optimization Strategiesโ
1. Cloud vs. On-Premises Decision Matrixโ
2. Hybrid Deployment Strategyโ
Tier your workloads:
# Example routing logic
def route_request(request):
complexity_score = analyze_complexity(request)
if complexity_score < 0.3:
# 70% of queries - use small model
return "llama-3-8b-endpoint" # A10G GPU
elif complexity_score < 0.7:
# 20% of queries - use medium model
return "mixtral-8x7b-endpoint" # A100 80GB
else:
# 10% of queries - use large model
return "llama-3-70b-endpoint" # 2x H100 80GB
Cost Impact:
- Small model: 0.70/hr
- Medium model: 0.60/hr
- Large model: 1.20/hr
- Total: 12.00/hr (all queries on large model)
- Savings: 79%
3. Reserved Instances & Commitmentsโ
| Provider | Spot/Preemptible | 1-Year Reserved | 3-Year Reserved |
|---|---|---|---|
| AWS | 50-70% off | 30-40% off | 50-60% off |
| GCP | 60-91% off | 37% off | 55% off |
| Azure | 60-80% off | 30-50% off | 50-70% off |
Start with on-demand or spot instances for testing. Commit to reserved instances once usage patterns are established (typically after 2-3 months).
4. Used Datacenter GPU Marketโ
| GPU | New Price | Used Price | Considerations |
|---|---|---|---|
| Tesla V100 32GB | $8,000 | $2,000-3,500 | Older but reliable, good for smaller models |
| A40 48GB | $10,000 | $5,000-7,000 | Excellent for inference, good VRAM |
| A100 40GB | $12,000 | $7,000-9,000 | Still very capable for most workloads |
| A100 80GB | $18,000 | $12,000-15,000 | Best value in used market |
- Verify warranty transfer and remaining coverage
- Test thoroughly before deployment
- Check for mining history (can degrade GPUs)
- Ensure compatible firmware/drivers
Infrastructure Best Practicesโ
1. Inference Framework Selectionโ
- vLLM
- Text Generation Inference
- llama.cpp
Best for: Production throughput, concurrent users
# Installation
pip install vllm
# Launch server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--dtype float16
Pros:
- โ PagedAttention for memory efficiency
- โ Continuous batching
- โ Highest throughput
- โ OpenAI-compatible API
Cons:
- โ More complex setup
- โ GPU-only (CUDA required)
Best for: Production with HuggingFace models
# Docker deployment
docker run --gpus all \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3-8B-Instruct
Pros:
- โ Easy HuggingFace integration
- โ Good performance
- โ Production-ready
Cons:
- โ Slightly lower throughput than vLLM
- โ Less flexible memory management
Best for: CPU/Apple Silicon, edge deployment
# Installation
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Run server
./server -m models/llama-3-8b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080
Pros:
- โ CPU support (no GPU required)
- โ Excellent Apple Silicon support
- โ Wide quantization format support (GGUF)
- โ Low memory overhead
Cons:
- โ Lower throughput than GPU solutions
- โ Not optimized for high concurrency
2. Monitoring and Observabilityโ
Key metrics to track:
# Example monitoring setup (Prometheus + Grafana)
metrics = {
# GPU Metrics
"gpu_memory_used": "GPU memory utilization (%)",
"gpu_utilization": "GPU compute utilization (%)",
"gpu_temperature": "GPU temperature (ยฐC)",
# Inference Metrics
"ttft": "Time to first token (ms)",
"tokens_per_second": "Generation speed",
"requests_per_second": "Throughput",
"queue_length": "Pending requests",
# KV Cache Metrics
"kv_cache_usage": "KV cache memory (GB)",
"kv_cache_hit_rate": "Prefix cache hit rate (%)",
"active_sequences": "Concurrent active sequences",
# Business Metrics
"p50_latency": "Median latency (ms)",
"p99_latency": "99th percentile latency (ms)",
"error_rate": "Failed requests (%)",
}
3. Scaling Strategiesโ
Key Takeawaysโ
- Context length dominates memory: Long conversations consume more VRAM than model weights
- Use modern inference engines: vLLM with PagedAttention achieves 14-24x higher throughput by reducing memory waste
- Actual concurrency differs from theoretical: Prefix caching and average context lengths allow more users than worst-case math suggests
- Production requires datacenter GPUs: Consumer cards lack reliability, ECC memory, and support needed for 24/7 operation
- Test before buying: Real-world capacity typically exceeds theoretical calculations
Decision Frameworkโ
Resource Checklistโ
Before deploying to production:
- Load test with realistic traffic patterns (not just synthetic benchmarks)
- Measure actual KV cache usage with your specific use case
- Test failover and redundancy (what happens when a GPU fails?)
- Monitor costs daily for the first month
- Set up alerting for GPU memory, temperature, and latency
- Document your deployment (infrastructure as code)
- Plan for model updates (how will you deploy new versions?)
- Implement rate limiting to prevent abuse
- Set up logging for debugging and compliance
- Test disaster recovery procedures
Additional Resourcesโ
- vLLM Documentation: https://docs.vllm.ai/
- NVIDIA GPU Selector: https://www.nvidia.com/en-us/data-center/gpu-selector/
- LLM VRAM Calculator: https://apxml.com/tools/vram-calculator
- Cloud GPU Pricing Comparison: https://cloud-gpus.com/
Always validate with real workloads before finalizing hardware decisions. Cloud deployments allow testing without capital expense, making them ideal for determining actual requirements before committing to on-premises infrastructure.
- Start with a small cloud deployment (1ร L4 or A10G)
- Run your actual workload for 1-2 weeks
- Analyze metrics (GPU utilization, latency, throughput)
- Scale up or down based on real data
- Consider on-premises only after 3+ months of stable usage patterns