Skip to main content

VRAM Requirements for Self-Hosted LLM Production Deployments

ยท 13 min read
DevOps Team
DevOps & Infrastructure Team

Running production LLM inference requires careful VRAM planning. Total VRAM requirements consist of model weights, KV cache that grows with concurrent users and context length, and system overhead.

Key Insight

Modern inference engines like vLLM with PagedAttention can achieve 24x higher throughput than basic implementations by reducing memory waste from 60-80% to under 4%.

VRAM Estimation Formulaโ€‹

The formula for estimating total VRAM requirements in production:

Vtotal=Vmodel+Voverhead+(VKV_per_requestร—concurrent_requests)V_{total} = V_{model} + V_{overhead} + (V_{KV\_per\_request} \times concurrent\_requests)

Components Explainedโ€‹

ComponentVariableDescription
Model WeightsVmodelV_{model}Fixed VRAM for quantized model parameters (depends on model size and quantization level)
System OverheadVoverheadV_{overhead}Inference framework, CUDA, and OS overhead (~0.5-1 GB)
KV Cache per RequestVKV_per_requestV_{KV\_per\_request}Memory per active user session (grows with context length)
Concurrent RequestsN/ANumber of simultaneous inference sessions
KV Cache Scaling

For Llama 3 8B at full 8K context, the KV cache requires approximately 1.1 GB per sequence. This grows linearly with context length and batch size.

Model VRAM Requirementsโ€‹

ModelParametersVmodelV_{model} (Q4)VKVV_{KV} per 1K tokensBase VRAM
Mistral 7B7B4.5 GB~0.10 GB~5 GB
Llama 3 8B8B5.1 GB~0.11 GB~6 GB
Gemma 2 9B9B5.8 GB~0.12 GB~7 GB
KV Cache Scaling

Each 1,000 tokens consumes approximately 0.11GB additional VRAM for 7-8B models. Long conversations can quickly exceed model weight memory usage.

Critical Production Considerationsโ€‹

1. KV Cache Reality Checkโ€‹

For an A10 GPU with 24GB VRAM serving Llama 2 7B:

  • Model weights: 14 GB (2 bytes per parameter ร— 7B)
  • Available for KV cache: 10 GB
  • Total capacity: ~20K tokens (including prompts)

This severely limits concurrent users with long contexts.

# Example calculation for A10 GPU
model_vram = 14 # GB for Llama 2 7B in FP16
gpu_vram = 24 # GB total
overhead = 0.5 # GB

available_kv = gpu_vram - model_vram - overhead # 9.5 GB
tokens_per_gb = 2048 # approximate for 7B models
max_tokens = available_kv * tokens_per_gb # ~19,456 tokens

# With 8K context per user:
max_concurrent_users = max_tokens / 8192 # ~2.4 users
Production Insight

Most concurrent requests don't use maximum context length. With an average context window of 4K tokens, you can support significantly more users than worst-case calculations suggest.

2. Modern Optimization: PagedAttentionโ€‹

vLLM with PagedAttention revolutionizes memory efficiency:

  • Traditional systems: 60-80% memory waste
  • vLLM with PagedAttention: Less than 4% memory waste
  • Performance gain: Up to 24x higher throughput

Key Features:

  1. Paged Memory Management: Breaks KV cache into fixed-size blocks that can be stored non-contiguously
  2. Prefix Caching: When multiple requests share the same prompt (e.g., system prompts), stores only one copy
  3. Copy-on-Write: Creates new blocks only when sequences diverge
  4. Continuous Batching: Dynamically adds/removes requests at iteration level

Realistic Deployment Examplesโ€‹

Example 1: Small Production (10-20 users)โ€‹

Scenario: Customer support chatbot with moderate traffic

  • Model: Llama 3 8B (Q4_K_M quantization)
  • Average context: 4K tokens per user
  • Peak concurrent users: 20

Example 2: Medium Production (50-100 users)โ€‹

Scenario: Internal AI assistant for enterprise

  • Model: Mistral 7B (Q4_K_M) with vLLM
  • Average context: 3K tokens
  • Peak concurrent users: 100
  • Optimizations: PagedAttention + prefix caching

Example 3: Enterprise Scale (200+ users)โ€‹

Scenario: High-traffic SaaS platform

  • Model: Llama 3 70B (AWQ 4-bit)
  • Average context: 6K tokens
  • Peak concurrent users: 200-500
  • Setup: Multi-GPU with tensor parallelism

Production GPU Recommendationsโ€‹

Datacenter GPUs: Cloud vs. Market Pricingโ€‹

Use CaseGPUCloud PriceMarket PriceRationale
Development/TestingRTX 4090 (24GB)N/A$1,600-2,000Exceptional for developers and small teams, handles 7B-13B models comfortably for prototyping
Development/TestingRTX 5090 (32GB)N/A$2,000-2,500Latest with 32GB VRAM, 30% faster than 4090 for transformers
Small ProductionL4 (24GB)$0.60-1.00/hr$4,000-6,000Power efficient (72W), compact, handles 7B-13B models for chatbots
Small ProductionRTX A4000 (16GB)$0.40-0.70/hr$1,500-2,000Enterprise-rated workstation GPU with ECC memory
Small ProductionA10G (24GB)$1.00-1.50/hr$5,000-7,000Cost-efficient for deployments, widely available in cloud
Standard ProductionA100 40GB$1.50-2.50/hr$10,000-12,000Proven, efficient, still powers production at scale
Standard ProductionA100 80GB$2.50-4.00/hr$15,000-18,000Double memory for larger models or more concurrent users
High-Performance LLMH100 80GB (PCIe)$4.00-6.00/hr$25,000-30,0002-3x faster than A100 for LLM workloads with FP8 support
High-Performance LLMH100 80GB (SXM)$6.00-10.00/hr$30,000-35,000NVLink for multi-GPU, best for transformer-heavy models
Multi-Model/Long-ContextH200 (141GB)$8.00-12.00/hr$40,000-55,00076% more memory and 43% higher bandwidth than H100
Frontier-ScaleB200 (192GB)$12.00-18.00/hr$30,000-35,000Availability severely constrained with 3-6 month lead times
Price Variability

Cloud prices vary by provider (AWS, GCP, Azure, RunPod, Lambda Labs). Market prices are approximate as of November 2025 and fluctuate based on supply/demand.

Monthly Cost Examples (24/7 Operation)โ€‹

Target: 10-20 concurrent users

ConfigurationMonthly CostTCO (1 year)
1ร— L4 (cloud)$450-750$5,400-9,000
1ร— RTX 4090 (on-prem)$50-100*$2,100**

*Power costs only
**Includes hardware cost

โš ๏ธ Consumer GPU Warning for Productionโ€‹

Do Not Use Consumer GPUs in Production

Consumer GPUs (RTX 4090/5090) are NOT recommended for production deployments despite attractive pricing. Here's why:

FeatureConsumer GPUsDatacenter GPUs
ECC MemoryโŒ No (data corruption risk)โœ… Yes (data integrity)
24/7 OperationโŒ Not ratedโœ… Designed for continuous use
Warranty1-3 years, consumer support3-5 years, enterprise SLAs
Multi-GPU ScalingLimited/No NVLinkโœ… NVLink, NVSwitch
Power ManagementGaming-focusedDatacenter-optimized
MIG SupportโŒ Noโœ… Yes (resource partitioning)
Thermal DesignIntermittent loadsContinuous operation

Acceptable Consumer GPU Use Cases:

  • Development and testing environments
  • Proof-of-concept deployments
  • Personal projects and learning
  • Budget-constrained startups (with awareness of risks)

Cost Optimization Strategiesโ€‹

1. Cloud vs. On-Premises Decision Matrixโ€‹

2. Hybrid Deployment Strategyโ€‹

Tier your workloads:

# Example routing logic
def route_request(request):
complexity_score = analyze_complexity(request)

if complexity_score < 0.3:
# 70% of queries - use small model
return "llama-3-8b-endpoint" # A10G GPU
elif complexity_score < 0.7:
# 20% of queries - use medium model
return "mixtral-8x7b-endpoint" # A100 80GB
else:
# 10% of queries - use large model
return "llama-3-70b-endpoint" # 2x H100 80GB

Cost Impact:

  • Small model: 1.00/hrร—0.70=1.00/hr ร— 0.70 = 0.70/hr
  • Medium model: 3.00/hrร—0.20=3.00/hr ร— 0.20 = 0.60/hr
  • Large model: 12.00/hrร—0.10=12.00/hr ร— 0.10 = 1.20/hr
  • Total: 2.50/hrvs.2.50/hr vs. 12.00/hr (all queries on large model)
  • Savings: 79%

3. Reserved Instances & Commitmentsโ€‹

ProviderSpot/Preemptible1-Year Reserved3-Year Reserved
AWS50-70% off30-40% off50-60% off
GCP60-91% off37% off55% off
Azure60-80% off30-50% off50-70% off
Recommendation

Start with on-demand or spot instances for testing. Commit to reserved instances once usage patterns are established (typically after 2-3 months).

4. Used Datacenter GPU Marketโ€‹

GPUNew PriceUsed PriceConsiderations
Tesla V100 32GB$8,000$2,000-3,500Older but reliable, good for smaller models
A40 48GB$10,000$5,000-7,000Excellent for inference, good VRAM
A100 40GB$12,000$7,000-9,000Still very capable for most workloads
A100 80GB$18,000$12,000-15,000Best value in used market
Used GPU Caution
  • Verify warranty transfer and remaining coverage
  • Test thoroughly before deployment
  • Check for mining history (can degrade GPUs)
  • Ensure compatible firmware/drivers

Infrastructure Best Practicesโ€‹

1. Inference Framework Selectionโ€‹

Best for: Production throughput, concurrent users

# Installation
pip install vllm

# Launch server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--dtype float16

Pros:

  • โœ… PagedAttention for memory efficiency
  • โœ… Continuous batching
  • โœ… Highest throughput
  • โœ… OpenAI-compatible API

Cons:

  • โŒ More complex setup
  • โŒ GPU-only (CUDA required)

2. Monitoring and Observabilityโ€‹

Key metrics to track:

# Example monitoring setup (Prometheus + Grafana)
metrics = {
# GPU Metrics
"gpu_memory_used": "GPU memory utilization (%)",
"gpu_utilization": "GPU compute utilization (%)",
"gpu_temperature": "GPU temperature (ยฐC)",

# Inference Metrics
"ttft": "Time to first token (ms)",
"tokens_per_second": "Generation speed",
"requests_per_second": "Throughput",
"queue_length": "Pending requests",

# KV Cache Metrics
"kv_cache_usage": "KV cache memory (GB)",
"kv_cache_hit_rate": "Prefix cache hit rate (%)",
"active_sequences": "Concurrent active sequences",

# Business Metrics
"p50_latency": "Median latency (ms)",
"p99_latency": "99th percentile latency (ms)",
"error_rate": "Failed requests (%)",
}

3. Scaling Strategiesโ€‹

Key Takeawaysโ€‹

Summary
  1. Context length dominates memory: Long conversations consume more VRAM than model weights
  2. Use modern inference engines: vLLM with PagedAttention achieves 14-24x higher throughput by reducing memory waste
  3. Actual concurrency differs from theoretical: Prefix caching and average context lengths allow more users than worst-case math suggests
  4. Production requires datacenter GPUs: Consumer cards lack reliability, ECC memory, and support needed for 24/7 operation
  5. Test before buying: Real-world capacity typically exceeds theoretical calculations

Decision Frameworkโ€‹

Resource Checklistโ€‹

Before deploying to production:

  • Load test with realistic traffic patterns (not just synthetic benchmarks)
  • Measure actual KV cache usage with your specific use case
  • Test failover and redundancy (what happens when a GPU fails?)
  • Monitor costs daily for the first month
  • Set up alerting for GPU memory, temperature, and latency
  • Document your deployment (infrastructure as code)
  • Plan for model updates (how will you deploy new versions?)
  • Implement rate limiting to prevent abuse
  • Set up logging for debugging and compliance
  • Test disaster recovery procedures

Additional Resourcesโ€‹


Always validate with real workloads before finalizing hardware decisions. Cloud deployments allow testing without capital expense, making them ideal for determining actual requirements before committing to on-premises infrastructure.

Next Steps
  1. Start with a small cloud deployment (1ร— L4 or A10G)
  2. Run your actual workload for 1-2 weeks
  3. Analyze metrics (GPU utilization, latency, throughput)
  4. Scale up or down based on real data
  5. Consider on-premises only after 3+ months of stable usage patterns