●Enterprise-grade · Air-gapped · 2-year hardware horizon
Self-Hosted LLMs
for Coding Agents
A complete hardware, model, and infrastructure reference for running state-of-the-art coding agents entirely on-premises.
10
Evaluated Models
Leading coding LLMs
97.5%
Peak HumanEval
Qwen3-235B-A22B
192 GB
Max GPU VRAM
NVIDIA B200 SXM
5
Inference Engines
Production-ready
6
Open-license Models
Apache 2.0 / MIT
2026 Recommendation
Pair NVIDIA H200 or B200 GPUs with vLLM + SGLang, running Qwen2.5-Coder-32B or DeepSeek-R1 (671B MoE) for maximum agentic coding performance on-premises.
DeepSeek-R1vLLMH200 SXMNVLink 4.0FP8
HumanEval+ Score (%) — Higher is better
Click a bar to inspect model details below
Qwen3-235B-A22BApache 2.0
Parameters235B (MoE)
Active Params22B
Context Window128K
Min VRAM (FP16)48GB (FP8 MoE)
HumanEval+97.5%
MBPP+95.1%
SWE-bench52.8%
Thinking modeTool-callingBest Apache 2.0 coding modelMultilingual
Capability Profile — Qwen3-235B-A22B
Supported quantization formats
FP16BF16AWQGGUFFP8
GPU Hardware
Specifications from official product pages · NVIDIA Data Center · AMD Instinct
Memory Bandwidth (GB/s) — Higher is better
NVIDIA
AMD
NVIDIA H200 SXM
Hopper
VRAM
141 GB
Bandwidth
5 TB/s
BF16
989 TFLOPS
TDP
700 W
Large model inference, 70B–405B models
NVLink↗ Official Specs
NVIDIA B200 SXM
Blackwell
VRAM
192 GB
Bandwidth
8 TB/s
BF16
2 PFLOPS
TDP
1 kW
Frontier model inference, 405B–671B models
NVLink↗ Official Specs
NVIDIA GB200 NVL72
Blackwell
VRAM
13.5 TB
Bandwidth
576 TB/s
BF16
1440 PFLOPS
TDP
120 kW
Full cluster inference, trillion-param models
NVLink↗ Official Specs
AMD MI325X
CDNA 3
VRAM
256 GB
Bandwidth
6 TB/s
BF16
1 PFLOPS
TDP
750 W
Large model inference, 70B–405B alternative
AMD MI350X
CDNA 4
VRAM
288 GB
Bandwidth
8 TB/s
BF16
2 PFLOPS
TDP
800 W
Next-gen inference, competes with B200
NVIDIA H100 SXM5
Hopper
VRAM
80 GB
Bandwidth
3 TB/s
BF16
989 TFLOPS
TDP
700 W
7B–70B models, cost-effective workhorse
NVLink↗ Official Specs
Rack-Scale SystemNVIDIA GB200 NVL72
72 GPUs
B200 per rack
13.5 TB
Total HBM3e
1.44 ExaFLOPS
BF16 (rack)
576 TB/s
NVLink bandwidth
120 kW
Rack TDP
Designed for running 671B+ MoE frontier models at full FP8 precision. Entire rack connected via NVLink 5.0 with unified memory space — no tensor parallelism overhead.
↗ NVIDIA GB200 NVL72 — Official Product PageInference Frameworks
Production serving engines — ranked by peak throughput capability
vL
vLLM
vLLM Project (UC Berkeley) · Python · Apache 2.0
Production-grade inference engine with continuous batching and paged KV-cache management.
Best for: Enterprise REST API serving, high-concurrency workloads
⚡ PagedAttentionMulti-GPUSpeculative DecodingOpenAI-compatible
Relative Throughput
High
Quantization Support
FP8AWQGPTQINT8
Te
TensorRT-LLM
NVIDIA · C++/Python · Apache 2.0
Maximum throughput on NVIDIA hardware via custom CUDA kernels and TensorRT graph optimization.
Best for: NVIDIA-only infra, maximum tokens/sec
⚡ NVIDIA-optimized kernelsMulti-GPUSpeculative DecodingTriton Inference Server
Relative Throughput
Highest (NVIDIA)
Quantization Support
FP8INT4INT8AWQGPTQ
SG
SGLang
LMSYS / Stanford · Python · Apache 2.0
Structured generation language with RadixAttention for shared prefix caching and fast JSON mode.
Best for: Agentic workloads, tool-calling, structured outputs
⚡ RadixAttentionMulti-GPUSpeculative DecodingOpenAI-compatible
Relative Throughput
Very High
Quantization Support
FP8AWQGPTQINT8
ll
llama.cpp
Georgi Gerganov (Community) · C++ · MIT
Highly portable inference for CPU and GPU. Runs on virtually any hardware with GGUF quantization.
Best for: Air-gapped / edge deployments, low-VRAM servers
⚡ CPU+GPU hybridMulti-GPUSpeculative DecodingOpenAI-compatible (server mode)
Relative Throughput
Moderate
Quantization Support
GGUF (Q2–Q8)IQ (imatrix)
Ol
Ollama
Ollama Inc. · Go/C++ · MIT
Developer-friendly wrapper around llama.cpp for instant local model deployment.
Best for: Developer workstations, internal tooling prototypes
⚡ One-command deployOpenAI-compatible
Relative Throughput
Moderate
Quantization Support
GGUF all variants
Quick Comparison
| Framework | Multi-GPU | Spec. Decode | API Compat | Best For |
|---|---|---|---|---|
| vLLM | ✓ | ✓ | OpenAI-compatible | Enterprise REST API serving |
| TensorRT-LLM | ✓ | ✓ | Triton Inference Server | NVIDIA-only infra |
| SGLang | ✓ | ✓ | OpenAI-compatible | Agentic workloads |
| llama.cpp | ✓ | ✓ | OpenAI-compatible (server mode) | Air-gapped / edge deployments |
| Ollama | — | — | OpenAI-compatible | Developer workstations |
Infrastructure & Deployment
GPU requirements per model size · Based on vLLM Distributed Serving Docs
7B
params
Qwen2.5-Coder-7B
Developer IDE assistants, lightweight agentic tools
1× H100 80GBAlt: 1× MI300X
FP16 VRAM
14 GB
FP8 VRAM
7 GB
Throughput
~1,500 tok/s
Latency
<50ms TTFT
Framework
vLLM / Ollama32B
params
Qwen2.5-Coder-32B
Team-level coding agents, CI/CD review bots
1× H200 SXMAlt: 2× H100 80GB
FP16 VRAM
64 GB
FP8 VRAM
32 GB
Throughput
~700 tok/s
Latency
<120ms TTFT
Framework
vLLM / SGLang70B
params
Llama 3.3 70B
Org-wide coding assistant, complex refactoring agents
1× H200 SXMAlt: 2× H100 80GB / 1× MI325X
FP16 VRAM
140 GB
FP8 VRAM
70 GB
Throughput
~400 tok/s
Latency
<200ms TTFT
Framework
vLLM / TensorRT-LLM236B MoE
params
DeepSeek-Coder-V2
Enterprise-grade multi-agent coding systems
4× H200 SXMAlt: 2× B200 SXM / 2× MI325X
FP16 VRAM
320 GB
FP8 VRAM
160 GB
Throughput
~200 tok/s
Latency
<400ms TTFT
Framework
vLLM / SGLang671B MoE
params
DeepSeek-R1
Frontier reasoning & autonomous coding agents
8× H200 SXMAlt: 4× B200 / GB200 NVL rack
FP16 VRAM
720 GB
FP8 VRAM
360 GB
Throughput
~80 tok/s
Latency
<800ms TTFT
Framework
vLLM / SGLangRecommended Enterprise Stack
Orchestration
Kubernetes + KServe / Ray Serve
Serving API
vLLM / SGLang (OpenAI-compatible)
Optimization
TensorRT-LLM (NVIDIA) / ROCm (AMD)
GPU Interconnect
NVLink 4/5 + InfiniBand NDR 400 Gb/s
Storage
NVMe-oF / Local NVMe RAID (>7 GB/s)
Observability
Prometheus + Grafana + DCGM
Enterprise Considerations
Security, compliance, and operational requirements for production LLM deployment
🔒Data Privacy
Critical- ●All inference stays on-premises — no data leaves your network
- ●Air-gapped deployment supported with llama.cpp or vLLM
- ●Encrypted model weights at rest (LUKS / BitLocker)
- ●TLS termination for all API endpoints
✅Compliance
High- ●SOC 2 Type II: Audit logging, access control, change management
- ●HIPAA: PHI never sent to external LLM APIs
- ●GDPR: No personal data processed by third-party model providers
- ●FedRAMP: On-prem meets data residency requirements
⚡High Availability
High- ●Multi-node inference with load balancing (Kubernetes + KServe)
- ●N+1 GPU redundancy for critical workloads
- ●Rolling model updates with zero-downtime hot-swap
- ●Health checks and auto-restart via systemd or Kubernetes liveness probes
🌐Networking
High- ●InfiniBand HDR/NDR (400Gb/s) for multi-GPU tensor parallelism
- ●NVLink 4.0 for within-node GPU interconnect (900 GB/s)
- ●25GbE minimum for client-to-inference-server traffic
- ●RDMA over Converged Ethernet (RoCE) as InfiniBand alternative
💾Storage
Medium- ●NVMe SSDs (>7 GB/s sequential) for fast model loading (<60s for 70B)
- ●Min 4TB raw storage per inference node for multiple model versions
- ●Shared NFS/NVMe-oF for centralized model repository
- ●Snapshots and versioning for model weight management
📊Monitoring & Observability
Medium- ●Prometheus + Grafana for GPU utilization, throughput, latency metrics
- ●DCGM (Data Center GPU Manager) for NVIDIA GPU health
- ●OpenTelemetry tracing for per-request latency breakdown
- ●Alerting on GPU memory pressure, thermal throttling, error rates
Total Cost of Ownership — Key Factors
CapEx: Hardware
GPU servers, networking, storage
OpEx: Power
H100 ~700W, B200 ~1kW per GPU
OpEx: Cooling
Data center PUE 1.1–1.3×
Labor: MLOps
Model deployment, monitoring, updates
Licensing: Models
Open weights reduce per-call fees to $0
ROI: Privacy
No data egress, GDPR/HIPAA compliance
Sources & References
All data linked to primary sources — official product pages, academic papers, and benchmark leaderboards
Data current as of Q1 2026 · Hardware specs from official vendor datasheets · Benchmarks from independent leaderboards · All models open-weights
Not financial or procurement advice. Verify specifications with vendors before purchase.