TurboAgent v1.1.0 combines NF4 4-bit weight quantization with streaming KV-cache compression. Qwen2.5-32B runs on a single 24 GB GPU at 65k context, 100% retrieval accuracy at every needle depth tested. RULER-style evaluation, one-line install.
NF4 4-bit weight quantization (via bitsandbytes) shrinks Qwen2.5-32B's weights from ~64 GB BF16 to ~18 GB. Streaming KV keeps the compressed KV cache on CPU and streams only the slice needed for each attention computation. The combination fits a 32B model on a 24 GB consumer GPU — first time TurboAgent has done this.
| Context | Accuracy | Decode | Compression | Compressed KV | Peak VRAM | Headroom |
|---|---|---|---|---|---|---|
| 16,000 tokens | 5/5 (100%) | 111 tok/s | 5.28x | 782 MB | 18.80 GiB | 4.7 GiB |
| 32,000 tokens | 5/5 (100%) | 91 tok/s | 5.28x | 1,562 MB | 19.67 GiB | 3.8 GiB |
| 65,000 tokens | 5/5 (100%) | 59 tok/s | 5.28x | 3,170 MB | 21.47 GiB | 2.0 GiB |
| 96,000 tokens | OOM | — | — | — | — | 827 MB free |
15/15 NIAH PASS at 16k / 32k / 65k. Decode throughput drops from 111 → 59 tok/s between 16k and 65k as the per-decode-step PCIe traffic grows (each step transfers one layer's KV slice at a time from CPU pinned memory). NF4 weights consume ~18 GB; the streaming-KV working buffer scales linearly at ~0.87 GiB per 16k tokens. 96k OOMs at the hardware ceiling (1.44 GiB allocation needed, 827 MB free). 65k is the reliable maximum for 32B NF4 on a single 24 GB GPU.
NIAH tests retrieval fidelity; downstream task validation (LongBench, RULER full suite, agent-loop coherence) is ongoing — see the limitations section below.
How to reproduce: install turboagent-ai[torch,bnb,native]==1.1.0,
then call TorchEngine(model_id="Qwen/Qwen2.5-32B-Instruct", quantize_weights="nf4",
kv_storage="cpu_streaming"). The hardware detector picks NF4 automatically when it
sees a single 24 GB GPU and a 30B+ model.
BF16 streaming-KV results on workstation hardware — the same engine code path that powers the single-4090 NF4 numbers above. 50/50 needle-in-haystack depths PASS across Qwen2.5-7B, Qwen2.5-32B, and Gemma 4 31B at lengths from 4k to 96k tokens.
| Context | Accuracy | Compression | FP16 KV | Compressed KV | Peak VRAM |
|---|---|---|---|---|---|
| 16,000 tokens | 5/5 (100%) | 5.80x | 15.5 GB | 2.7 GB | 32.84 GiB |
| 32,000 tokens | 5/5 (100%) | 5.80x | 30.9 GB | 5.3 GB | 36.32 GiB |
| 65,000 tokens | 5/5 (100%) | 5.80x | 62.8 GB | 10.8 GB | 45.31 GiB |
| 96,000 tokens | 5/5 (100%) | 5.80x | 92.7 GB | 16.0 GB | 52.97 GiB |
| Context | Accuracy | Compression | FP16 KV | Compressed KV | Peak VRAM |
|---|---|---|---|---|---|
| 4,000 tokens | 5/5 (100%) | 5.28x | 983 MB | 186 MB | — |
| 16,000 tokens | 5/5 (100%) | 5.28x | 3.9 GB | 730 MB | — |
| 32,000 tokens | 5/5 (100%) | 5.28x | 7.7 GB | 1.5 GB | — |
| 65,000 tokens | 5/5 (100%) | 5.28x | 16.7 GB | 3.2 GB | 33.16 GiB |
| 96,000 tokens | 5/5 (100%) | 5.28x | 24.7 GB | 4.7 GB | 34.85 GiB |
| Context | Accuracy | Compression | Compressed KV | Peak VRAM |
|---|---|---|---|---|
| 16,000 tokens | 5/5 (100%) | 5.22x | 173 MB | 19.95 GiB |
Streaming KV runs zero compression intra-turn. Each turn's KV is buffered raw (BF16) on CPU during prefill and decode; quantization to TurboQuant happens once at turn boundary, so cosine similarity inside the active turn is > 0.999. Pinned-memory async prefetch overlaps PCIe transfer with attention compute. Block-allocated buffers (16-token blocks) eliminate per-step heap allocations.
NIAH tests retrieval fidelity; downstream task validation (LongBench, RULER full suite, agent-loop coherence) is ongoing — see the limitations section below.
Auto-detects your GPU and selects NF4 weight quantization, streaming KV, and backend automatically. The hardware detector recommends NF4 when it sees a 30B+ model on a 24 GB GPU.
For Qwen2.5-32B at 65k: ~18 GB weights + 3.2 GB KV on CPU + ~3 GB working buffers = fits a single 4090.
from turboagent.backends.torch import TorchEngine engine = TorchEngine( model_id="Qwen/Qwen2.5-32B-Instruct", quantize_weights="nf4", # 4-bit weights kv_storage="cpu_streaming", # compressed KV on CPU ) response, metrics = engine.generate_chat( [{"role": "user", "content": "..."}], kv_cache, ) # Peak VRAM at 65k context: ~21.5 GiB / 24 GiB
v1.1.0 ships the streaming KV + NF4 layers. v1.2 adds two more: BitTorch fine-tuning (QLoRA / LoRA / DPO) and FlexTensor automatic weight offloading. Together with the existing TurboQuant KV compression, this is the four-layer memory stack that unlocks Llama-3.1-70B with 200k+ context on a single RTX 4090 at ~3–8 tok/s decode.
QLoRA, LoRA, DPO, GRPO, and prefix tuning ported from BitTorch AI. Train adapters on the same NF4-quantized base model you serve — no separate training pipeline.
Profile-driven CPU↔GPU weight placement (vs. static n_gpu_layers tuning). Automatically keeps hot tensors on GPU; cold tensors stream from system RAM. Estimated 30–50% throughput gain over manual offload.
Tauri + React frontend lifted from BitTorch AI. Model download, quantization, fine-tuning, and inference in one app. Live KV-compression dashboard, multi-turn agent inspector, TurboSwarm visualization.
| Configuration | v1.1 (today) | v1.2 (planned) |
|---|---|---|
| Qwen2.5-32B / single 4090 (24 GB) | 65k context ✓ | 200k+ context |
| Llama-3.1-70B / single 4090 (24 GB) | doesn't fit | 200k+ context, 3–8 tok/s |
| Llama-3.1-70B / dual 4090 (48 GB) | weights only fit BF16 | full GPU, 15–25 tok/s |
Architectural positioning vs. the closest open-source alternatives for running large models with long context on consumer hardware. We don't have head-to-head benchmark numbers across these stacks; the table reflects what each approach does and doesn't do.
| llama.cpp | vLLM | Hugging Face Transformers | TurboAgent v1.1 | |
|---|---|---|---|---|
| 4-bit weight quant | Q4_K_M (GGUF) | AWQ / GPTQ | NF4 (bnb) | NF4 (bnb) |
| KV cache compression | — | FP8 KV (2x) | — | TurboQuant 5.28–5.80x |
| CPU-side KV streaming | — | — | — | Yes (compressed, async prefetch) |
| Manual GPU/CPU split | n_gpu_layers | tensor parallel only | device_map | device_map="auto" + FlexTensor (v1.2) |
| Multi-turn agent layer | — | — | — | Yes (TurboAgent / TurboSwarm / RAG) |
| 32B at 65k on single 24 GB GPU | partial (no long-ctx) | doesn't fit | doesn't fit | 15/15 NIAH PASS |
llama.cpp is the closest functional comparison — it serves 4-bit weights on consumer GPUs via GGUF quantization. The key gap: llama.cpp doesn't compress the KV cache, so long contexts blow out VRAM regardless of weight quantization. vLLM offers FP8 KV (2x) but no CPU offload of the cache, so it needs multi-GPU for 32B+. TurboAgent's combination — NF4 weights + 5.28x compressed KV streamed from CPU — is the only Python stack we're aware of that fits 32B at 65k on a single 24 GB GPU.
What we measured, how we measured it, and what we explicitly haven't measured yet. Honesty about benchmark scope is more useful than a "zero accuracy loss" claim that doesn't survive contact with downstream tasks.
XK-9234) embedded in a [CRITICAL INFORMATION] bracketed phrase.temperature=0.0 (greedy), SDPA attention, num_logits_to_keep=1 for prefill, chunked prefill at 512 tokens.
When you call TurboAgent(...) without specifying quantize_weights,
the hardware detector picks based on model size and total VRAM:
| Model size | VRAM tier | Detector recommends |
|---|---|---|
| < 60B params | any | quantize_weights=None (BF16) |
| 60B+ params | 40 GB+ | quantize_weights="nf4" |
| 60B+ params | 24 GB | quantize_weights="nf4" + kv_storage="cpu_streaming" |
| 60B+ params | 16 GB / 8–12 GB | quantize_weights="nf4" + smaller chunked prefill |
All recommendations can be overridden by passing the keyword argument explicitly. If
bitsandbytes isn't installed, NF4 falls back to BF16 with a clear log message
instead of failing silently.
NIAH is the standard for retrieval but it's narrow — it tests "can the model find the needle?", not "does compression preserve multi-hop reasoning, JSON-schema adherence, or agent-loop coherence?" Items planned for v1.2 evaluation:
Everything you need for long-context agentic AI on consumer hardware.
4-bit weights via bitsandbytes. Drops Qwen2.5-32B from 64 GB BF16 to 18 GB. quantize_weights="nf4", double-quant on, BF16 compute dtype.
Compressed KV stays on CPU; only the active layer's slice is on GPU. Pinned memory + async prefetch overlap PCIe with compute. 16-token block buffers, zero per-step allocs.
5.22–5.80x KV cache compression with zero intra-turn quantization noise. Native bit-packing kernels (turboquant-kv); 1.5–2x fallback in pure PyTorch.
Specialist agents share a single compressed KV pool. Round-robin or custom routing. ~5x less memory than independent contexts.
Inner-product-preserving search using TurboQuant codebooks. Better recall than Product Quantization at the same bit budget.
FastAPI /v1/chat/completions, drop-in for any OpenAI SDK. Docker with GPU passthrough, auth, rate limiting, per-session compressed KV.
Validated against the paper's theoretical bounds on real hardware.
| Mode | Bits/Value | Compression | 32B KV @ 65k | Best For |
|---|---|---|---|---|
| turbo3 | 3.25 bpv | 5.22–5.80x | ~3.2 GB | Maximum context on limited VRAM |
| turbo4 | 4.25 bpv | 3.8x | ~4.4 GB | Higher fidelity, ample memory |
| FP16 baseline | 16 bpv | 1x | ~16.7 GB | Reference (no compression) |
Specialist agents share a single compressed KV cache. Agent B attends to context generated by Agent A without re-encoding. TurboQuant's inner-product fidelity ensures attention accuracy across agent boundaries.
from turboagent.agents.swarm import ( TurboSwarm, SwarmAgent ) swarm = TurboSwarm( "Qwen/Qwen2.5-32B-Instruct", quantize_weights="nf4", agents=[ SwarmAgent(name="researcher", role="research"), SwarmAgent(name="critic", role="review"), SwarmAgent(name="writer", role="writing"), ], ) results = swarm.run("Analyze KV cache compression.")
Cross-model, cross-hardware validation with production models.
| Model | Hardware | VRAM | Max Context | NIAH Accuracy |
|---|---|---|---|---|
| Qwen2.5-32B (NF4) | Single RTX 4090 | 24 GB | 65,000 tokens | 15/15 (100%) |
| Qwen2.5-32B (BF16) | RTX PRO 6000 × 2 | 192 GB | 96,000 tokens | 25/25 (100%) |
| Gemma 4 31B (BF16) | RTX PRO 6000 × 2 | 192 GB | 96,000 tokens | 20/20 (100%) |
| Qwen2.5-7B (BF16) | Single RTX 4090 | 24 GB | 16,000 tokens | 5/5 (100%) |
Total: 65/65 NIAH depths PASS across 4 model/hardware configurations. Combined Streaming KV + NF4 path enables the "32B on 24 GB" headline result.
SSO, audit logging, compliance exports (SOC-2, GDPR), governance policies, multi-node KV sharing, priority kernels, dedicated support.
The open-source core is free forever under the MIT license.