●Enterprise-grade · Air-gapped · 2-year hardware horizon

Self-Hosted LLMs
for Coding Agents

A complete hardware, model, and infrastructure reference for running state-of-the-art coding agents entirely on-premises.

Evaluated Models

Leading coding LLMs

97.5%

Peak HumanEval

Qwen3-235B-A22B

192 GB

Max GPU VRAM

NVIDIA B200 SXM

Inference Engines

Production-ready

Open-license Models

Apache 2.0 / MIT

2026 Recommendation

Pair NVIDIA H200 or B200 GPUs with vLLM + SGLang, running Qwen2.5-Coder-32B or DeepSeek-R1 (671B MoE) for maximum agentic coding performance on-premises.

DeepSeek-R1vLLMH200 SXMNVLink 4.0FP8

Coding Model Comparison

Benchmark scores from EvalPlus · SWE-bench · Updated Q1 2026

HumanEval+ Score (%) — Higher is better

Click a bar to inspect model details below

Qwen3-235B-A22BApache 2.0

Parameters235B (MoE)

Active Params22B

Context Window128K

Min VRAM (FP16)48GB (FP8 MoE)

HumanEval+97.5%

MBPP+95.1%

SWE-bench52.8%

Thinking modeTool-callingBest Apache 2.0 coding modelMultilingual

↗ HuggingFace / Model Page ↗ Paper / Announcement ↗ Benchmark Leaderboard

Capability Profile — Qwen3-235B-A22B

Supported quantization formats

FP16BF16AWQGGUFFP8

GPU Hardware

Specifications from official product pages · NVIDIA Data Center · AMD Instinct

Memory Bandwidth (GB/s) — Higher is better

NVIDIA

AMD

NVIDIA H200 SXM

Hopper

Current Gen

VRAM

141 GB

Bandwidth

5 TB/s

BF16

989 TFLOPS

TDP

700 W

Large model inference, 70B–405B models

NVLink↗ Official Specs

NVIDIA B200 SXM

Blackwell

Next Gen

VRAM

192 GB

Bandwidth

8 TB/s

BF16

2 PFLOPS

TDP

1 kW

Frontier model inference, 405B–671B models

NVLink↗ Official Specs

NVIDIA GB200 NVL72

Blackwell

Rack-Scale

VRAM

13.5 TB

Bandwidth

576 TB/s

BF16

1440 PFLOPS

TDP

120 kW

Full cluster inference, trillion-param models

NVLink↗ Official Specs

AMD MI325X

CDNA 3

Current Gen

VRAM

256 GB

Bandwidth

6 TB/s

BF16

1 PFLOPS

TDP

750 W

Large model inference, 70B–405B alternative

↗ Official Specs

AMD MI350X

CDNA 4

Next Gen

VRAM

288 GB

Bandwidth

8 TB/s

BF16

2 PFLOPS

TDP

800 W

Next-gen inference, competes with B200

↗ Official Specs

NVIDIA H100 SXM5

Hopper

Proven

VRAM

80 GB

Bandwidth

3 TB/s

BF16

989 TFLOPS

TDP

700 W

7B–70B models, cost-effective workhorse

NVLink↗ Official Specs

Rack-Scale SystemNVIDIA GB200 NVL72

72 GPUs

B200 per rack

13.5 TB

Total HBM3e

1.44 ExaFLOPS

BF16 (rack)

576 TB/s

NVLink bandwidth

120 kW

Rack TDP

Designed for running 671B+ MoE frontier models at full FP8 precision. Entire rack connected via NVLink 5.0 with unified memory space — no tensor parallelism overhead.

↗ NVIDIA GB200 NVL72 — Official Product Page

Inference Frameworks

Production serving engines — ranked by peak throughput capability

vLLM

vLLM Project (UC Berkeley) · Python · Apache 2.0

45k+ ★

Production-grade inference engine with continuous batching and paged KV-cache management.

Best for: Enterprise REST API serving, high-concurrency workloads

⚡ PagedAttentionMulti-GPUSpeculative DecodingOpenAI-compatible

↗ GitHub ↗ Documentation

Relative Throughput

High

Quantization Support

FP8AWQGPTQINT8

TensorRT-LLM

NVIDIA · C++/Python · Apache 2.0

9k+ ★

Maximum throughput on NVIDIA hardware via custom CUDA kernels and TensorRT graph optimization.

Best for: NVIDIA-only infra, maximum tokens/sec

⚡ NVIDIA-optimized kernelsMulti-GPUSpeculative DecodingTriton Inference Server

↗ GitHub ↗ Documentation

Relative Throughput

Highest (NVIDIA)

Quantization Support

FP8INT4INT8AWQGPTQ

SGLang

LMSYS / Stanford · Python · Apache 2.0

15k+ ★

Structured generation language with RadixAttention for shared prefix caching and fast JSON mode.

Best for: Agentic workloads, tool-calling, structured outputs

⚡ RadixAttentionMulti-GPUSpeculative DecodingOpenAI-compatible

↗ GitHub ↗ Documentation

Relative Throughput

Very High

Quantization Support

FP8AWQGPTQINT8

llama.cpp

Georgi Gerganov (Community) · C++ · MIT

72k+ ★

Highly portable inference for CPU and GPU. Runs on virtually any hardware with GGUF quantization.

Best for: Air-gapped / edge deployments, low-VRAM servers

⚡ CPU+GPU hybridMulti-GPUSpeculative DecodingOpenAI-compatible (server mode)

↗ GitHub ↗ Documentation

Relative Throughput

Moderate

Quantization Support

GGUF (Q2–Q8)IQ (imatrix)

Ollama

Ollama Inc. · Go/C++ · MIT

115k+ ★

Developer-friendly wrapper around llama.cpp for instant local model deployment.

Best for: Developer workstations, internal tooling prototypes

⚡ One-command deployOpenAI-compatible

↗ GitHub ↗ Documentation

Relative Throughput

Moderate

Quantization Support

GGUF all variants

Quick Comparison

Framework	Multi-GPU	Spec. Decode	API Compat	Best For
vLLM	✓	✓	OpenAI-compatible	Enterprise REST API serving
TensorRT-LLM	✓	✓	Triton Inference Server	NVIDIA-only infra
SGLang	✓	✓	OpenAI-compatible	Agentic workloads
llama.cpp	✓	✓	OpenAI-compatible (server mode)	Air-gapped / edge deployments
Ollama	—	—	OpenAI-compatible	Developer workstations

Infrastructure & Deployment

GPU requirements per model size · Based on vLLM Distributed Serving Docs

params

Qwen2.5-Coder-7B

Developer IDE assistants, lightweight agentic tools

1× H100 80GBAlt: 1× MI300X

FP16 VRAM

14 GB

FP8 VRAM

7 GB

Throughput

~1,500 tok/s

Latency

<50ms TTFT

Framework

vLLM / Ollama

32B

params

Qwen2.5-Coder-32B

Team-level coding agents, CI/CD review bots

1× H200 SXMAlt: 2× H100 80GB

FP16 VRAM

64 GB

FP8 VRAM

32 GB

Throughput

~700 tok/s

Latency

<120ms TTFT

Framework

vLLM / SGLang

70B

params

Llama 3.3 70B

Org-wide coding assistant, complex refactoring agents

1× H200 SXMAlt: 2× H100 80GB / 1× MI325X

FP16 VRAM

140 GB

FP8 VRAM

70 GB

Throughput

~400 tok/s

Latency

<200ms TTFT

Framework

vLLM / TensorRT-LLM

236B MoE

params

DeepSeek-Coder-V2

Enterprise-grade multi-agent coding systems

4× H200 SXMAlt: 2× B200 SXM / 2× MI325X

FP16 VRAM

320 GB

FP8 VRAM

160 GB

Throughput

~200 tok/s

Latency

<400ms TTFT

Framework

vLLM / SGLang

671B MoE

params

DeepSeek-R1

Frontier reasoning & autonomous coding agents

8× H200 SXMAlt: 4× B200 / GB200 NVL rack

FP16 VRAM

720 GB

FP8 VRAM

360 GB

Throughput

~80 tok/s

Latency

<800ms TTFT

Framework

vLLM / SGLang

Recommended Enterprise Stack

Orchestration

Kubernetes + KServe / Ray Serve

↗ KServe

Serving API

vLLM / SGLang (OpenAI-compatible)

↗ vLLM Docs

Optimization

TensorRT-LLM (NVIDIA) / ROCm (AMD)

↗ TensorRT-LLM

GPU Interconnect

NVLink 4/5 + InfiniBand NDR 400 Gb/s

↗ NVIDIA InfiniBand

Storage

NVMe-oF / Local NVMe RAID (>7 GB/s)

↗ SNIA NVMe-oF

Observability

Prometheus + Grafana + DCGM

↗ NVIDIA DCGM

Enterprise Considerations

Security, compliance, and operational requirements for production LLM deployment

🔒Data Privacy

Critical

●All inference stays on-premises — no data leaves your network
●Air-gapped deployment supported with llama.cpp or vLLM
●Encrypted model weights at rest (LUKS / BitLocker)
●TLS termination for all API endpoints

↗ NIST AI RMF

✅Compliance

High

●SOC 2 Type II: Audit logging, access control, change management
●HIPAA: PHI never sent to external LLM APIs
●GDPR: No personal data processed by third-party model providers
●FedRAMP: On-prem meets data residency requirements

↗ AICPA SOC 2

⚡High Availability

High

●Multi-node inference with load balancing (Kubernetes + KServe)
●N+1 GPU redundancy for critical workloads
●Rolling model updates with zero-downtime hot-swap
●Health checks and auto-restart via systemd or Kubernetes liveness probes

↗ KServe Docs

🌐Networking

High

●InfiniBand HDR/NDR (400Gb/s) for multi-GPU tensor parallelism
●NVLink 4.0 for within-node GPU interconnect (900 GB/s)
●25GbE minimum for client-to-inference-server traffic
●RDMA over Converged Ethernet (RoCE) as InfiniBand alternative

↗ NVIDIA InfiniBand

💾Storage

Medium

●NVMe SSDs (>7 GB/s sequential) for fast model loading (<60s for 70B)
●Min 4TB raw storage per inference node for multiple model versions
●Shared NFS/NVMe-oF for centralized model repository
●Snapshots and versioning for model weight management

↗ SNIA NVMe-oF

📊Monitoring & Observability

Medium

●Prometheus + Grafana for GPU utilization, throughput, latency metrics
●DCGM (Data Center GPU Manager) for NVIDIA GPU health
●OpenTelemetry tracing for per-request latency breakdown
●Alerting on GPU memory pressure, thermal throttling, error rates

↗ NVIDIA DCGM

Total Cost of Ownership — Key Factors

CapEx: Hardware

GPU servers, networking, storage

OpEx: Power

H100 ~700W, B200 ~1kW per GPU

OpEx: Cooling

Data center PUE 1.1–1.3×

Labor: MLOps

Model deployment, monitoring, updates

Licensing: Models

Open weights reduce per-call fees to $0

ROI: Privacy

No data egress, GDPR/HIPAA compliance

Sources & References

All data linked to primary sources — official product pages, academic papers, and benchmark leaderboards

Benchmark Leaderboards

EvalPlus Leaderboard

HumanEval+ and MBPP+ benchmarks

SWE-bench Leaderboard

Real-world GitHub issue resolution

BigCode Models Leaderboard

Code generation across 30+ languages

Open LLM Leaderboard

General + coding benchmark aggregator

Official Hardware Specs

NVIDIA H200 SXM

141GB HBM3e, 4.8TB/s

NVIDIA B200 SXM

192GB HBM3e, 2.25 PFLOPS BF16

NVIDIA GB200 NVL72

Rack-scale, 1.44 ExaFLOPS

AMD MI325X

256GB HBM3e, 1307 TFLOPS

Model Papers & Pages

DeepSeek-R1 Paper

MIT-licensed, 49.2% SWE-bench

DeepSeek-Coder-V2 Paper

236B MoE coding model

Qwen2.5-Coder Paper

92.7% HumanEval+, Apache 2.0

StarCoder2 Paper

BigCode, The Stack v2

Inference Framework Docs

vLLM

PagedAttention

TensorRT-LLM

NVIDIA-optimized kernels

Compliance & Standards

NIST AI Risk Management Framework

NIST AI RMF 1.0

AICPA SOC 2

SOC 2 Type II standard

KServe (Kubernetes Serving)

Model serving on Kubernetes

NVIDIA DCGM

Data Center GPU Manager

Data current as of Q1 2026 · Hardware specs from official vendor datasheets · Benchmarks from independent leaderboards · All models open-weights

Not financial or procurement advice. Verify specifications with vendors before purchase.

Self-Hosted LLMsfor Coding Agents

Self-Hosted LLMs
for Coding Agents