VRAM Requirements for Self-Hosted LLM Production Deployments

November 10, 2025 · 13 min read

DevOps & Infrastructure Team

Running production LLM inference requires careful VRAM planning. Total VRAM requirements consist of model weights, KV cache that grows with concurrent users and context length, and system overhead.

Key Insight

Modern inference engines like vLLM with PagedAttention can achieve 24x higher throughput than basic implementations by reducing memory waste from 60-80% to under 4%.

VRAM Estimation Formula

The formula for estimating total VRAM requirements in production:

V_{total} = V_{model} + V_{overhead} + (V_{KV\_per\_request} \times concurrent\_requests)

Components Explained

Component	Variable	Description
Model Weights	$V_{model}$	Fixed VRAM for quantized model parameters (depends on model size and quantization level)
System Overhead	$V_{overhead}$	Inference framework, CUDA, and OS overhead (~0.5-1 GB)
KV Cache per Request	$V_{KV\_per\_request}$	Memory per active user session (grows with context length)
Concurrent Requests	N/A	Number of simultaneous inference sessions

KV Cache Scaling

For Llama 3 8B at full 8K context, the KV cache requires approximately 1.1 GB per sequence. This grows linearly with context length and batch size.

Model VRAM Requirements

Popular Models (4-bit Quantization)

Small Models (7-9B)
Medium Models (13-34B)
Large Models (70B+)

Model	Parameters	$V_{model}$ (Q4)	$V_{KV}$ per 1K tokens	Base VRAM
Mistral 7B	7B	4.5 GB	~0.10 GB	~5 GB
Llama 3 8B	8B	5.1 GB	~0.11 GB	~6 GB
Gemma 2 9B	9B	5.8 GB	~0.12 GB	~7 GB

Model	Parameters	$V_{model}$ (Q4)	$V_{KV}$ per 1K tokens	Base VRAM
Llama 3 13B	13B	8.0 GB	~0.15 GB	~9 GB
Yi 34B	34B	20 GB	~0.40 GB	~21 GB

Model	Parameters	$V_{model}$ (Q4)	$V_{KV}$ per 1K tokens	Base VRAM
Mixtral 8x7B	47B total	28 GB	~0.60 GB	~29 GB
Llama 3 70B	70B	40 GB	~0.80 GB	~41 GB
Qwen 2.5 72B	72B	42 GB	~0.85 GB	~43 GB

KV Cache Scaling

Each 1,000 tokens consumes approximately 0.11GB additional VRAM for 7-8B models. Long conversations can quickly exceed model weight memory usage.

Critical Production Considerations

1. KV Cache Reality Check

For an A10 GPU with 24GB VRAM serving Llama 2 7B:

Model weights: 14 GB (2 bytes per parameter × 7B)
Available for KV cache: 10 GB
Total capacity: ~20K tokens (including prompts)

This severely limits concurrent users with long contexts.

# Example calculation for A10 GPU
model_vram = 14  # GB for Llama 2 7B in FP16
gpu_vram = 24    # GB total
overhead = 0.5   # GB

available_kv = gpu_vram - model_vram - overhead  # 9.5 GB
tokens_per_gb = 2048  # approximate for 7B models
max_tokens = available_kv * tokens_per_gb  # ~19,456 tokens

# With 8K context per user:
max_concurrent_users = max_tokens / 8192  # ~2.4 users

Production Insight

Most concurrent requests don't use maximum context length. With an average context window of 4K tokens, you can support significantly more users than worst-case calculations suggest.

2. Modern Optimization: PagedAttention

vLLM with PagedAttention revolutionizes memory efficiency:

Traditional systems: 60-80% memory waste
vLLM with PagedAttention: Less than 4% memory waste
Performance gain: Up to 24x higher throughput

Key Features:

Paged Memory Management: Breaks KV cache into fixed-size blocks that can be stored non-contiguously
Prefix Caching: When multiple requests share the same prompt (e.g., system prompts), stores only one copy
Copy-on-Write: Creates new blocks only when sequences diverge
Continuous Batching: Dynamically adds/removes requests at iteration level

Realistic Deployment Examples

Example 1: Small Production (10-20 users)

Scenario: Customer support chatbot with moderate traffic

Specifications
Calculation
Cost Analysis

Model: Llama 3 8B (Q4_K_M quantization)
Average context: 4K tokens per user
Peak concurrent users: 20

Base model:        5.1 GB
System overhead:   0.5 GB
Per-user KV (4K):  0.44 GB
20 users:          5.1 + 0.5 + (0.44 × 20) = 14.4 GB

Recommended GPU: RTX 4060 Ti 16GB (dev) or L4 16GB (production)

Option	Setup	Monthly Cost	Notes
Self-hosted L4	Cloud	~$450/mo	24/7 operation
Self-hosted RTX 4060 Ti	On-prem	$499 one-time	Dev/testing only
Cloud (spot)	Cloud	~$200-300/mo	Interruptible

Example 2: Medium Production (50-100 users)

Scenario: Internal AI assistant for enterprise

Specifications
Calculation
Architecture

Model: Mistral 7B (Q4_K_M) with vLLM
Average context: 3K tokens
Peak concurrent users: 100
Optimizations: PagedAttention + prefix caching

Base model:              4.5 GB
System overhead:         0.5 GB
Per-user KV (avg 3K):    0.30 GB
Theoretical (100 users): 4.5 + 0.5 + (0.30 × 100) = 35 GB

With PagedAttention efficiency and prefix caching:
Actual requirement:      ~20-25 GB

Recommended GPU: A10G 24GB or RTX 4090 24GB (development)

Example 3: Enterprise Scale (200+ users)

Scenario: High-traffic SaaS platform

Specifications
Calculation
Performance

Model: Llama 3 70B (AWQ 4-bit)
Average context: 6K tokens
Peak concurrent users: 200-500
Setup: Multi-GPU with tensor parallelism

Model weights:           40 GB (AWQ 4-bit)
System overhead:         1 GB
Tensor parallelism:      Distributed across 2 GPUs

Per-GPU utilization:     ~35 GB (model) + KV cache
Total system:            2× A100 80GB or 2× H100 80GB

Deployment: vLLM with continuous batching and tensor parallelism

Production GPU Recommendations

Datacenter GPUs: Cloud vs. Market Pricing

Use Case	GPU	Cloud Price	Market Price	Rationale
Development/Testing	RTX 4090 (24GB)	N/A	$1,600-2,000	Exceptional for developers and small teams, handles 7B-13B models comfortably for prototyping
Development/Testing	RTX 5090 (32GB)	N/A	$2,000-2,500	Latest with 32GB VRAM, 30% faster than 4090 for transformers
Small Production	L4 (24GB)	$0.60-1.00/hr	$4,000-6,000	Power efficient (72W), compact, handles 7B-13B models for chatbots
Small Production	RTX A4000 (16GB)	$0.40-0.70/hr	$1,500-2,000	Enterprise-rated workstation GPU with ECC memory
Small Production	A10G (24GB)	$1.00-1.50/hr	$5,000-7,000	Cost-efficient for deployments, widely available in cloud
Standard Production	A100 40GB	$1.50-2.50/hr	$10,000-12,000	Proven, efficient, still powers production at scale
Standard Production	A100 80GB	$2.50-4.00/hr	$15,000-18,000	Double memory for larger models or more concurrent users
High-Performance LLM	H100 80GB (PCIe)	$4.00-6.00/hr	$25,000-30,000	2-3x faster than A100 for LLM workloads with FP8 support
High-Performance LLM	H100 80GB (SXM)	$6.00-10.00/hr	$30,000-35,000	NVLink for multi-GPU, best for transformer-heavy models
Multi-Model/Long-Context	H200 (141GB)	$8.00-12.00/hr	$40,000-55,000	76% more memory and 43% higher bandwidth than H100
Frontier-Scale	B200 (192GB)	$12.00-18.00/hr	$30,000-35,000	Availability severely constrained with 3-6 month lead times

Price Variability

Cloud prices vary by provider (AWS, GCP, Azure, RunPod, Lambda Labs). Market prices are approximate as of November 2025 and fluctuate based on supply/demand.

Monthly Cost Examples (24/7 Operation)

Small Scale
Medium Scale
Enterprise Scale

Target: 10-20 concurrent users

Configuration	Monthly Cost	TCO (1 year)
1× L4 (cloud)	$450-750	$5,400-9,000
1× RTX 4090 (on-prem)	$50-100*	$2,100**

*Power costs only
**Includes hardware cost

Target: 50-100 concurrent users

Configuration	Monthly Cost	TCO (1 year)
1× A10G (cloud)	$750-1,100	$9,000-13,200
1× A100 40GB (cloud)	$1,100-1,800	$13,200-21,600
1× A100 80GB (on-prem)	$400-600*	$19,800**

*Power costs only
**Includes hardware cost ($15,000)

Target: 200+ concurrent users

Configuration	Monthly Cost	TCO (1 year)
2× A100 80GB (cloud)	$3,600-5,800	$43,200-69,600
2× H100 80GB (cloud)	$8,600-14,400	$103,200-172,800
2× H100 80GB (on-prem)	$1,000-1,500*	$78,000**

*Power costs only
**Includes hardware cost ($60,000)

⚠️ Consumer GPU Warning for Production

Do Not Use Consumer GPUs in Production

Consumer GPUs (RTX 4090/5090) are NOT recommended for production deployments despite attractive pricing. Here's why:

Feature	Consumer GPUs	Datacenter GPUs
ECC Memory	❌ No (data corruption risk)	✅ Yes (data integrity)
24/7 Operation	❌ Not rated	✅ Designed for continuous use
Warranty	1-3 years, consumer support	3-5 years, enterprise SLAs
Multi-GPU Scaling	Limited/No NVLink	✅ NVLink, NVSwitch
Power Management	Gaming-focused	Datacenter-optimized
MIG Support	❌ No	✅ Yes (resource partitioning)
Thermal Design	Intermittent loads	Continuous operation

Acceptable Consumer GPU Use Cases:

Development and testing environments
Proof-of-concept deployments
Personal projects and learning
Budget-constrained startups (with awareness of risks)

Cost Optimization Strategies

1. Cloud vs. On-Premises Decision Matrix

2. Hybrid Deployment Strategy

Tier your workloads:

# Example routing logic
def route_request(request):
    complexity_score = analyze_complexity(request)
    
    if complexity_score < 0.3:
        # 70% of queries - use small model
        return "llama-3-8b-endpoint"  # A10G GPU
    elif complexity_score < 0.7:
        # 20% of queries - use medium model
        return "mixtral-8x7b-endpoint"  # A100 80GB
    else:
        # 10% of queries - use large model
        return "llama-3-70b-endpoint"  # 2x H100 80GB

Cost Impact:

Small model: $1.00/hr × 0.70 =$ 0.70/hr
Medium model: $3.00/hr × 0.20 =$ 0.60/hr
Large model: $12.00/hr × 0.10 =$ 1.20/hr
Total: $2.50/hr vs.$ 12.00/hr (all queries on large model)
Savings: 79%

3. Reserved Instances & Commitments

Provider	Spot/Preemptible	1-Year Reserved	3-Year Reserved
AWS	50-70% off	30-40% off	50-60% off
GCP	60-91% off	37% off	55% off
Azure	60-80% off	30-50% off	50-70% off

Recommendation

Start with on-demand or spot instances for testing. Commit to reserved instances once usage patterns are established (typically after 2-3 months).

4. Used Datacenter GPU Market

GPU	New Price	Used Price	Considerations
Tesla V100 32GB	$8,000	$2,000-3,500	Older but reliable, good for smaller models
A40 48GB	$10,000	$5,000-7,000	Excellent for inference, good VRAM
A100 40GB	$12,000	$7,000-9,000	Still very capable for most workloads
A100 80GB	$18,000	$12,000-15,000	Best value in used market

Used GPU Caution

Verify warranty transfer and remaining coverage
Test thoroughly before deployment
Check for mining history (can degrade GPUs)
Ensure compatible firmware/drivers

Infrastructure Best Practices

1. Inference Framework Selection

vLLM
Text Generation Inference
llama.cpp

Best for: Production throughput, concurrent users

# Installation
pip install vllm

# Launch server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --dtype float16

Pros:

✅ PagedAttention for memory efficiency
✅ Continuous batching
✅ Highest throughput
✅ OpenAI-compatible API

Cons:

❌ More complex setup
❌ GPU-only (CUDA required)

Best for: Production with HuggingFace models

# Docker deployment
docker run --gpus all \
  -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3-8B-Instruct

Pros:

✅ Easy HuggingFace integration
✅ Good performance
✅ Production-ready

Cons:

❌ Slightly lower throughput than vLLM
❌ Less flexible memory management

Best for: CPU/Apple Silicon, edge deployment

# Installation
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run server
./server -m models/llama-3-8b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080

Pros:

✅ CPU support (no GPU required)
✅ Excellent Apple Silicon support
✅ Wide quantization format support (GGUF)
✅ Low memory overhead

Cons:

❌ Lower throughput than GPU solutions
❌ Not optimized for high concurrency

2. Monitoring and Observability

Key metrics to track:

# Example monitoring setup (Prometheus + Grafana)
metrics = {
    # GPU Metrics
    "gpu_memory_used": "GPU memory utilization (%)",
    "gpu_utilization": "GPU compute utilization (%)",
    "gpu_temperature": "GPU temperature (°C)",
    
    # Inference Metrics
    "ttft": "Time to first token (ms)",
    "tokens_per_second": "Generation speed",
    "requests_per_second": "Throughput",
    "queue_length": "Pending requests",
    
    # KV Cache Metrics
    "kv_cache_usage": "KV cache memory (GB)",
    "kv_cache_hit_rate": "Prefix cache hit rate (%)",
    "active_sequences": "Concurrent active sequences",
    
    # Business Metrics
    "p50_latency": "Median latency (ms)",
    "p99_latency": "99th percentile latency (ms)",
    "error_rate": "Failed requests (%)",
}

3. Scaling Strategies

Key Takeaways

Summary

Context length dominates memory: Long conversations consume more VRAM than model weights
Use modern inference engines: vLLM with PagedAttention achieves 14-24x higher throughput by reducing memory waste
Actual concurrency differs from theoretical: Prefix caching and average context lengths allow more users than worst-case math suggests
Production requires datacenter GPUs: Consumer cards lack reliability, ECC memory, and support needed for 24/7 operation
Test before buying: Real-world capacity typically exceeds theoretical calculations

Decision Framework

Resource Checklist

Before deploying to production:

Additional Resources

vLLM Documentation: https://docs.vllm.ai/
NVIDIA GPU Selector: https://www.nvidia.com/en-us/data-center/gpu-selector/
LLM VRAM Calculator: https://apxml.com/tools/vram-calculator
Cloud GPU Pricing Comparison: https://cloud-gpus.com/

Always validate with real workloads before finalizing hardware decisions. Cloud deployments allow testing without capital expense, making them ideal for determining actual requirements before committing to on-premises infrastructure.

Next Steps

Start with a small cloud deployment (1× L4 or A10G)
Run your actual workload for 1-2 weeks
Analyze metrics (GPU utilization, latency, throughput)
Scale up or down based on real data
Consider on-premises only after 3+ months of stable usage patterns

VRAM Estimation Formula​

Components Explained​

Model VRAM Requirements​

Popular Models (4-bit Quantization)​

Critical Production Considerations​

1. KV Cache Reality Check​

2. Modern Optimization: PagedAttention​

Realistic Deployment Examples​

Example 1: Small Production (10-20 users)​

Example 2: Medium Production (50-100 users)​

Example 3: Enterprise Scale (200+ users)​

Production GPU Recommendations​

Datacenter GPUs: Cloud vs. Market Pricing​

Monthly Cost Examples (24/7 Operation)​

⚠️ Consumer GPU Warning for Production​

Cost Optimization Strategies​

1. Cloud vs. On-Premises Decision Matrix​

2. Hybrid Deployment Strategy​

3. Reserved Instances & Commitments​

4. Used Datacenter GPU Market​

Infrastructure Best Practices​

1. Inference Framework Selection​

2. Monitoring and Observability​

3. Scaling Strategies​

Key Takeaways​

Decision Framework​

Resource Checklist​

Additional Resources​

VRAM Estimation Formula

Components Explained

Model VRAM Requirements

Popular Models (4-bit Quantization)

Critical Production Considerations

1. KV Cache Reality Check

2. Modern Optimization: PagedAttention

Realistic Deployment Examples

Example 1: Small Production (10-20 users)

Example 2: Medium Production (50-100 users)

Example 3: Enterprise Scale (200+ users)

Production GPU Recommendations

Datacenter GPUs: Cloud vs. Market Pricing

Monthly Cost Examples (24/7 Operation)

⚠️ Consumer GPU Warning for Production

Cost Optimization Strategies

1. Cloud vs. On-Premises Decision Matrix

2. Hybrid Deployment Strategy

3. Reserved Instances & Commitments

4. Used Datacenter GPU Market

Infrastructure Best Practices

1. Inference Framework Selection

2. Monitoring and Observability

3. Scaling Strategies

Key Takeaways

Decision Framework

Resource Checklist

Additional Resources