v1.1.0 — NOW ON PyPI

32B models on a single RTX 4090. 65k context.

TurboAgent v1.1.0 combines NF4 4-bit weight quantization with streaming KV-cache compression. Qwen2.5-32B runs on a single 24 GB GPU at 65k context, 100% retrieval accuracy at every needle depth tested. RULER-style evaluation, one-line install.

pip install turboagent-ai[torch,bnb,native]==1.1.0
32B
on a Single 24 GB GPU
65k
Tokens on Single RTX 4090
50/50
NIAH Depths PASS (3 Models)
5.80x
KV Compression (Gemma 4 31B)

v1.1.0 Headline: NF4 + Streaming KV on a Single RTX 4090

NF4 4-bit weight quantization (via bitsandbytes) shrinks Qwen2.5-32B's weights from ~64 GB BF16 to ~18 GB. Streaming KV keeps the compressed KV cache on CPU and streams only the slice needed for each attention computation. The combination fits a 32B model on a 24 GB consumer GPU — first time TurboAgent has done this.

Qwen2.5-32B-Instruct (NF4 + Streaming KV) on Single RTX 4090 (24 GB)

ContextAccuracyDecodeCompressionCompressed KVPeak VRAMHeadroom
16,000 tokens5/5 (100%)111 tok/s5.28x782 MB18.80 GiB4.7 GiB
32,000 tokens5/5 (100%)91 tok/s5.28x1,562 MB19.67 GiB3.8 GiB
65,000 tokens5/5 (100%)59 tok/s5.28x3,170 MB21.47 GiB2.0 GiB
96,000 tokensOOM827 MB free

15/15 NIAH PASS at 16k / 32k / 65k. Decode throughput drops from 111 → 59 tok/s between 16k and 65k as the per-decode-step PCIe traffic grows (each step transfers one layer's KV slice at a time from CPU pinned memory). NF4 weights consume ~18 GB; the streaming-KV working buffer scales linearly at ~0.87 GiB per 16k tokens. 96k OOMs at the hardware ceiling (1.44 GiB allocation needed, 827 MB free). 65k is the reliable maximum for 32B NF4 on a single 24 GB GPU.

NIAH tests retrieval fidelity; downstream task validation (LongBench, RULER full suite, agent-loop coherence) is ongoing — see the limitations section below.

How to reproduce: install turboagent-ai[torch,bnb,native]==1.1.0, then call TorchEngine(model_id="Qwen/Qwen2.5-32B-Instruct", quantize_weights="nf4", kv_storage="cpu_streaming"). The hardware detector picks NF4 automatically when it sees a single 24 GB GPU and a 30B+ model.

Streaming KV — Validated Across 3 Models, 5 Context Lengths

BF16 streaming-KV results on workstation hardware — the same engine code path that powers the single-4090 NF4 numbers above. 50/50 needle-in-haystack depths PASS across Qwen2.5-7B, Qwen2.5-32B, and Gemma 4 31B at lengths from 4k to 96k tokens.

Gemma 4 31B (BF16, 2× RTX PRO 6000 Blackwell)

ContextAccuracyCompressionFP16 KVCompressed KVPeak VRAM
16,000 tokens5/5 (100%)5.80x15.5 GB2.7 GB32.84 GiB
32,000 tokens5/5 (100%)5.80x30.9 GB5.3 GB36.32 GiB
65,000 tokens5/5 (100%)5.80x62.8 GB10.8 GB45.31 GiB
96,000 tokens5/5 (100%)5.80x92.7 GB16.0 GB52.97 GiB

Qwen2.5-32B (BF16, 2× RTX PRO 6000 Blackwell)

ContextAccuracyCompressionFP16 KVCompressed KVPeak VRAM
4,000 tokens5/5 (100%)5.28x983 MB186 MB
16,000 tokens5/5 (100%)5.28x3.9 GB730 MB
32,000 tokens5/5 (100%)5.28x7.7 GB1.5 GB
65,000 tokens5/5 (100%)5.28x16.7 GB3.2 GB33.16 GiB
96,000 tokens5/5 (100%)5.28x24.7 GB4.7 GB34.85 GiB

Qwen2.5-7B (BF16, Single RTX 4090)

ContextAccuracyCompressionCompressed KVPeak VRAM
16,000 tokens5/5 (100%)5.22x173 MB19.95 GiB

Streaming KV runs zero compression intra-turn. Each turn's KV is buffered raw (BF16) on CPU during prefill and decode; quantization to TurboQuant happens once at turn boundary, so cosine similarity inside the active turn is > 0.999. Pinned-memory async prefetch overlaps PCIe transfer with attention compute. Block-allocated buffers (16-token blocks) eliminate per-step heap allocations.

NIAH tests retrieval fidelity; downstream task validation (LongBench, RULER full suite, agent-loop coherence) is ongoing — see the limitations section below.

One-line setup, hardware-aware

Auto-detects your GPU and selects NF4 weight quantization, streaming KV, and backend automatically. The hardware detector recommends NF4 when it sees a 30B+ model on a 24 GB GPU.

For Qwen2.5-32B at 65k: ~18 GB weights + 3.2 GB KV on CPU + ~3 GB working buffers = fits a single 4090.

from turboagent.backends.torch import TorchEngine

engine = TorchEngine(
    model_id="Qwen/Qwen2.5-32B-Instruct",
    quantize_weights="nf4",       # 4-bit weights
    kv_storage="cpu_streaming",    # compressed KV on CPU
)

response, metrics = engine.generate_chat(
    [{"role": "user", "content": "..."}],
    kv_cache,
)
# Peak VRAM at 65k context: ~21.5 GiB / 24 GiB

What's Next: v1.2 — Llama-3.1-70B on a 4090

v1.1.0 ships the streaming KV + NF4 layers. v1.2 adds two more: BitTorch fine-tuning (QLoRA / LoRA / DPO) and FlexTensor automatic weight offloading. Together with the existing TurboQuant KV compression, this is the four-layer memory stack that unlocks Llama-3.1-70B with 200k+ context on a single RTX 4090 at ~3–8 tok/s decode.

BitTorch Fine-Tuning

QLoRA, LoRA, DPO, GRPO, and prefix tuning ported from BitTorch AI. Train adapters on the same NF4-quantized base model you serve — no separate training pipeline.

FlexTensor Weight Offload

Profile-driven CPU↔GPU weight placement (vs. static n_gpu_layers tuning). Automatically keeps hot tensors on GPU; cold tensors stream from system RAM. Estimated 30–50% throughput gain over manual offload.

Desktop GUI

Tauri + React frontend lifted from BitTorch AI. Model download, quantization, fine-tuning, and inference in one app. Live KV-compression dashboard, multi-turn agent inspector, TurboSwarm visualization.

Configurationv1.1 (today)v1.2 (planned)
Qwen2.5-32B / single 4090 (24 GB)65k context ✓200k+ context
Llama-3.1-70B / single 4090 (24 GB)doesn't fit200k+ context, 3–8 tok/s
Llama-3.1-70B / dual 4090 (48 GB)weights only fit BF16full GPU, 15–25 tok/s

How TurboAgent Compares

Architectural positioning vs. the closest open-source alternatives for running large models with long context on consumer hardware. We don't have head-to-head benchmark numbers across these stacks; the table reflects what each approach does and doesn't do.

 llama.cppvLLMHugging Face TransformersTurboAgent v1.1
4-bit weight quantQ4_K_M (GGUF)AWQ / GPTQNF4 (bnb)NF4 (bnb)
KV cache compressionFP8 KV (2x)TurboQuant 5.28–5.80x
CPU-side KV streamingYes (compressed, async prefetch)
Manual GPU/CPU splitn_gpu_layerstensor parallel onlydevice_mapdevice_map="auto" + FlexTensor (v1.2)
Multi-turn agent layerYes (TurboAgent / TurboSwarm / RAG)
32B at 65k on single 24 GB GPUpartial (no long-ctx)doesn't fitdoesn't fit15/15 NIAH PASS

llama.cpp is the closest functional comparison — it serves 4-bit weights on consumer GPUs via GGUF quantization. The key gap: llama.cpp doesn't compress the KV cache, so long contexts blow out VRAM regardless of weight quantization. vLLM offers FP8 KV (2x) but no CPU offload of the cache, so it needs multi-GPU for 32B+. TurboAgent's combination — NF4 weights + 5.28x compressed KV streamed from CPU — is the only Python stack we're aware of that fits 32B at 65k on a single 24 GB GPU.

Methodology, Reproducibility & Limitations

What we measured, how we measured it, and what we explicitly haven't measured yet. Honesty about benchmark scope is more useful than a "zero accuracy loss" claim that doesn't survive contact with downstream tasks.

Evaluation methodology

Hardware detector NF4 fallback

When you call TurboAgent(...) without specifying quantize_weights, the hardware detector picks based on model size and total VRAM:

Model sizeVRAM tierDetector recommends
< 60B paramsanyquantize_weights=None (BF16)
60B+ params40 GB+quantize_weights="nf4"
60B+ params24 GBquantize_weights="nf4" + kv_storage="cpu_streaming"
60B+ params16 GB / 8–12 GBquantize_weights="nf4" + smaller chunked prefill

All recommendations can be overridden by passing the keyword argument explicitly. If bitsandbytes isn't installed, NF4 falls back to BF16 with a clear log message instead of failing silently.

What we haven't measured (yet)

NIAH is the standard for retrieval but it's narrow — it tests "can the model find the needle?", not "does compression preserve multi-hop reasoning, JSON-schema adherence, or agent-loop coherence?" Items planned for v1.2 evaluation:

Features

Everything you need for long-context agentic AI on consumer hardware.

NF4 Weight Quantization

4-bit weights via bitsandbytes. Drops Qwen2.5-32B from 64 GB BF16 to 18 GB. quantize_weights="nf4", double-quant on, BF16 compute dtype.

Streaming KV Cache

Compressed KV stays on CPU; only the active layer's slice is on GPU. Pinned memory + async prefetch overlap PCIe with compute. 16-token block buffers, zero per-step allocs.

TurboQuant Compression

5.22–5.80x KV cache compression with zero intra-turn quantization noise. Native bit-packing kernels (turboquant-kv); 1.5–2x fallback in pure PyTorch.

Multi-Agent Swarms

Specialist agents share a single compressed KV pool. Round-robin or custom routing. ~5x less memory than independent contexts.

RAG Vector Store

Inner-product-preserving search using TurboQuant codebooks. Better recall than Product Quantization at the same bit budget.

OpenAI-Compatible Server

FastAPI /v1/chat/completions, drop-in for any OpenAI SDK. Docker with GPU passthrough, auth, rate limiting, per-session compressed KV.

Compression Modes

Validated against the paper's theoretical bounds on real hardware.

ModeBits/ValueCompression32B KV @ 65kBest For
turbo33.25 bpv5.22–5.80x~3.2 GBMaximum context on limited VRAM
turbo44.25 bpv3.8x~4.4 GBHigher fidelity, ample memory
FP16 baseline16 bpv1x~16.7 GBReference (no compression)

Multi-Agent Swarms

Specialist agents share a single compressed KV cache. Agent B attends to context generated by Agent A without re-encoding. TurboQuant's inner-product fidelity ensures attention accuracy across agent boundaries.

from turboagent.agents.swarm import (
    TurboSwarm, SwarmAgent
)

swarm = TurboSwarm(
    "Qwen/Qwen2.5-32B-Instruct",
    quantize_weights="nf4",
    agents=[
        SwarmAgent(name="researcher", role="research"),
        SwarmAgent(name="critic", role="review"),
        SwarmAgent(name="writer", role="writing"),
    ],
)

results = swarm.run("Analyze KV cache compression.")

Validated at Scale

Cross-model, cross-hardware validation with production models.

ModelHardwareVRAMMax ContextNIAH Accuracy
Qwen2.5-32B (NF4)Single RTX 409024 GB65,000 tokens15/15 (100%)
Qwen2.5-32B (BF16)RTX PRO 6000 × 2192 GB96,000 tokens25/25 (100%)
Gemma 4 31B (BF16)RTX PRO 6000 × 2192 GB96,000 tokens20/20 (100%)
Qwen2.5-7B (BF16)Single RTX 409024 GB16,000 tokens5/5 (100%)

Total: 65/65 NIAH depths PASS across 4 model/hardware configurations. Combined Streaming KV + NF4 path enables the "32B on 24 GB" headline result.

TurboAgent Enterprise

SSO, audit logging, compliance exports (SOC-2, GDPR), governance policies, multi-node KV sharing, priority kernels, dedicated support.

The open-source core is free forever under the MIT license.

Contact Sales