v1.1.0 — NOW ON PyPI

32B models on a single RTX 4090. 65k context.

TurboAgent v1.1.0 combines NF4 4-bit weight quantization with streaming KV-cache compression. Qwen2.5-32B runs on a single 24 GB GPU at 65k context, 100% retrieval accuracy at every needle depth tested. RULER-style evaluation, one-line install.

pip install turboagent-ai[torch,bnb,native]==1.1.0

View on GitHub Read the Docs

32B

on a Single 24 GB GPU

65k

Tokens on Single RTX 4090

50/50

NIAH Depths PASS (3 Models)

5.80x

KV Compression (Gemma 4 31B)

v1.1.0 Headline: NF4 + Streaming KV on a Single RTX 4090

NF4 4-bit weight quantization (via bitsandbytes) shrinks Qwen2.5-32B's weights from ~64 GB BF16 to ~18 GB. Streaming KV keeps the compressed KV cache on CPU and streams only the slice needed for each attention computation. The combination fits a 32B model on a 24 GB consumer GPU — first time TurboAgent has done this.

Qwen2.5-32B-Instruct (NF4 + Streaming KV) on Single RTX 4090 (24 GB)

Context	Accuracy	Decode	Compression	Compressed KV	Peak VRAM	Headroom
16,000 tokens	5/5 (100%)	111 tok/s	5.28x	782 MB	18.80 GiB	4.7 GiB
32,000 tokens	5/5 (100%)	91 tok/s	5.28x	1,562 MB	19.67 GiB	3.8 GiB
65,000 tokens	5/5 (100%)	59 tok/s	5.28x	3,170 MB	21.47 GiB	2.0 GiB
96,000 tokens	OOM	—	—	—	—	827 MB free

15/15 NIAH PASS at 16k / 32k / 65k. Decode throughput drops from 111 → 59 tok/s between 16k and 65k as the per-decode-step PCIe traffic grows (each step transfers one layer's KV slice at a time from CPU pinned memory). NF4 weights consume ~18 GB; the streaming-KV working buffer scales linearly at ~0.87 GiB per 16k tokens. 96k OOMs at the hardware ceiling (1.44 GiB allocation needed, 827 MB free). 65k is the reliable maximum for 32B NF4 on a single 24 GB GPU.

NIAH tests retrieval fidelity; downstream task validation (LongBench, RULER full suite, agent-loop coherence) is ongoing — see the limitations section below.

How to reproduce: install turboagent-ai[torch,bnb,native]==1.1.0, then call TorchEngine(model_id="Qwen/Qwen2.5-32B-Instruct", quantize_weights="nf4", kv_storage="cpu_streaming"). The hardware detector picks NF4 automatically when it sees a single 24 GB GPU and a 30B+ model.

Streaming KV — Validated Across 3 Models, 5 Context Lengths

BF16 streaming-KV results on workstation hardware — the same engine code path that powers the single-4090 NF4 numbers above. 50/50 needle-in-haystack depths PASS across Qwen2.5-7B, Qwen2.5-32B, and Gemma 4 31B at lengths from 4k to 96k tokens.

Gemma 4 31B (BF16, 2× RTX PRO 6000 Blackwell)

Context	Accuracy	Compression	FP16 KV	Compressed KV	Peak VRAM
16,000 tokens	5/5 (100%)	5.80x	15.5 GB	2.7 GB	32.84 GiB
32,000 tokens	5/5 (100%)	5.80x	30.9 GB	5.3 GB	36.32 GiB
65,000 tokens	5/5 (100%)	5.80x	62.8 GB	10.8 GB	45.31 GiB
96,000 tokens	5/5 (100%)	5.80x	92.7 GB	16.0 GB	52.97 GiB

Qwen2.5-32B (BF16, 2× RTX PRO 6000 Blackwell)

Context	Accuracy	Compression	FP16 KV	Compressed KV	Peak VRAM
4,000 tokens	5/5 (100%)	5.28x	983 MB	186 MB	—
16,000 tokens	5/5 (100%)	5.28x	3.9 GB	730 MB	—
32,000 tokens	5/5 (100%)	5.28x	7.7 GB	1.5 GB	—
65,000 tokens	5/5 (100%)	5.28x	16.7 GB	3.2 GB	33.16 GiB
96,000 tokens	5/5 (100%)	5.28x	24.7 GB	4.7 GB	34.85 GiB

Qwen2.5-7B (BF16, Single RTX 4090)

Context	Accuracy	Compression	Compressed KV	Peak VRAM
16,000 tokens	5/5 (100%)	5.22x	173 MB	19.95 GiB

Streaming KV runs zero compression intra-turn. Each turn's KV is buffered raw (BF16) on CPU during prefill and decode; quantization to TurboQuant happens once at turn boundary, so cosine similarity inside the active turn is > 0.999. Pinned-memory async prefetch overlaps PCIe transfer with attention compute. Block-allocated buffers (16-token blocks) eliminate per-step heap allocations.

NIAH tests retrieval fidelity; downstream task validation (LongBench, RULER full suite, agent-loop coherence) is ongoing — see the limitations section below.

One-line setup, hardware-aware

Auto-detects your GPU and selects NF4 weight quantization, streaming KV, and backend automatically. The hardware detector recommends NF4 when it sees a 30B+ model on a 24 GB GPU.

For Qwen2.5-32B at 65k: ~18 GB weights + 3.2 GB KV on CPU + ~3 GB working buffers = fits a single 4090.

from turboagent.backends.torch import TorchEngine

engine = TorchEngine(
    model_id="Qwen/Qwen2.5-32B-Instruct",
    quantize_weights="nf4",       # 4-bit weights
    kv_storage="cpu_streaming",    # compressed KV on CPU
)

response, metrics = engine.generate_chat(
    [{"role": "user", "content": "..."}],
    kv_cache,
)
# Peak VRAM at 65k context: ~21.5 GiB / 24 GiB

What's Next: v1.2 — Llama-3.1-70B on a 4090

v1.1.0 ships the streaming KV + NF4 layers. v1.2 adds two more: BitTorch fine-tuning (QLoRA / LoRA / DPO) and FlexTensor automatic weight offloading. Together with the existing TurboQuant KV compression, this is the four-layer memory stack that unlocks Llama-3.1-70B with 200k+ context on a single RTX 4090 at ~3–8 tok/s decode.

BitTorch Fine-Tuning

QLoRA, LoRA, DPO, GRPO, and prefix tuning ported from BitTorch AI. Train adapters on the same NF4-quantized base model you serve — no separate training pipeline.

FlexTensor Weight Offload

Profile-driven CPU↔GPU weight placement (vs. static n_gpu_layers tuning). Automatically keeps hot tensors on GPU; cold tensors stream from system RAM. Estimated 30–50% throughput gain over manual offload.

Desktop GUI

Tauri + React frontend lifted from BitTorch AI. Model download, quantization, fine-tuning, and inference in one app. Live KV-compression dashboard, multi-turn agent inspector, TurboSwarm visualization.

Configuration	v1.1 (today)	v1.2 (planned)
Qwen2.5-32B / single 4090 (24 GB)	65k context ✓	200k+ context
Llama-3.1-70B / single 4090 (24 GB)	doesn't fit	200k+ context, 3–8 tok/s
Llama-3.1-70B / dual 4090 (48 GB)	weights only fit BF16	full GPU, 15–25 tok/s

How TurboAgent Compares

Architectural positioning vs. the closest open-source alternatives for running large models with long context on consumer hardware. We don't have head-to-head benchmark numbers across these stacks; the table reflects what each approach does and doesn't do.

	llama.cpp	vLLM	Hugging Face Transformers	TurboAgent v1.1
4-bit weight quant	Q4_K_M (GGUF)	AWQ / GPTQ	NF4 (bnb)	NF4 (bnb)
KV cache compression	—	FP8 KV (2x)	—	TurboQuant 5.28–5.80x
CPU-side KV streaming	—	—	—	Yes (compressed, async prefetch)
Manual GPU/CPU split	`n_gpu_layers`	tensor parallel only	`device_map`	`device_map="auto"` + FlexTensor (v1.2)
Multi-turn agent layer	—	—	—	Yes (TurboAgent / TurboSwarm / RAG)
32B at 65k on single 24 GB GPU	partial (no long-ctx)	doesn't fit	doesn't fit	15/15 NIAH PASS

llama.cpp is the closest functional comparison — it serves 4-bit weights on consumer GPUs via GGUF quantization. The key gap: llama.cpp doesn't compress the KV cache, so long contexts blow out VRAM regardless of weight quantization. vLLM offers FP8 KV (2x) but no CPU offload of the cache, so it needs multi-GPU for 32B+. TurboAgent's combination — NF4 weights + 5.28x compressed KV streamed from CPU — is the only Python stack we're aware of that fits 32B at 65k on a single 24 GB GPU.

Methodology, Reproducibility & Limitations

What we measured, how we measured it, and what we explicitly haven't measured yet. Honesty about benchmark scope is more useful than a "zero accuracy loss" claim that doesn't survive contact with downstream tasks.

Evaluation methodology

RULER-style needle-in-haystack: 5 needle depths (0%, 25%, 50%, 75%, 100% of context) × N context lengths (16k, 32k, 65k, 96k).
Needle: randomly generated alphanumeric authorization code (e.g. XK-9234) embedded in a [CRITICAL INFORMATION] bracketed phrase.
Haystack: repeating natural-language filler text. Final prompt asks the model to repeat the code; grading is exact-string match against the model's first 64 generated tokens.
Compute settings: temperature=0.0 (greedy), SDPA attention, num_logits_to_keep=1 for prefill, chunked prefill at 512 tokens.
Hardware: NF4 single-4090 results from a Vast.ai RTX 4090 24 GB instance. BF16 results from 2× RTX PRO 6000 Blackwell (192 GB).

Hardware detector NF4 fallback

When you call TurboAgent(...) without specifying quantize_weights, the hardware detector picks based on model size and total VRAM:

Model size	VRAM tier	Detector recommends
< 60B params	any	`quantize_weights=None` (BF16)
60B+ params	40 GB+	`quantize_weights="nf4"`
60B+ params	24 GB	`quantize_weights="nf4"` + `kv_storage="cpu_streaming"`
60B+ params	16 GB / 8–12 GB	`quantize_weights="nf4"` + smaller chunked prefill

All recommendations can be overridden by passing the keyword argument explicitly. If bitsandbytes isn't installed, NF4 falls back to BF16 with a clear log message instead of failing silently.

What we haven't measured (yet)

NIAH is the standard for retrieval but it's narrow — it tests "can the model find the needle?", not "does compression preserve multi-hop reasoning, JSON-schema adherence, or agent-loop coherence?" Items planned for v1.2 evaluation:

LongBench / RULER full suite: multi-document QA, code completion, summarization at 16k–96k context with NF4 + Streaming KV.
Agent-loop coherence: 10-turn TurboSwarm session with shared compressed KV; does the critic agent still attend correctly to the researcher's earlier output after 5 compress/decompress cycles?
JSON-schema adherence under compression: structured-output benchmark before vs. after KV compress/decompress, measuring schema validity rate.
Head-to-head throughput: Qwen2.5-32B at 65k on single 4090 across TurboAgent / llama.cpp Q4_K_M / vLLM AWQ — same prompt, same hardware.

Features

Everything you need for long-context agentic AI on consumer hardware.

NF4 Weight Quantization

4-bit weights via bitsandbytes. Drops Qwen2.5-32B from 64 GB BF16 to 18 GB. quantize_weights="nf4", double-quant on, BF16 compute dtype.

Streaming KV Cache

Compressed KV stays on CPU; only the active layer's slice is on GPU. Pinned memory + async prefetch overlap PCIe with compute. 16-token block buffers, zero per-step allocs.

TurboQuant Compression

5.22–5.80x KV cache compression with zero intra-turn quantization noise. Native bit-packing kernels (turboquant-kv); 1.5–2x fallback in pure PyTorch.

Multi-Agent Swarms

Specialist agents share a single compressed KV pool. Round-robin or custom routing. ~5x less memory than independent contexts.

RAG Vector Store

Inner-product-preserving search using TurboQuant codebooks. Better recall than Product Quantization at the same bit budget.

OpenAI-Compatible Server

FastAPI /v1/chat/completions, drop-in for any OpenAI SDK. Docker with GPU passthrough, auth, rate limiting, per-session compressed KV.

Compression Modes

Validated against the paper's theoretical bounds on real hardware.

Mode	Bits/Value	Compression	32B KV @ 65k	Best For
turbo3	3.25 bpv	5.22–5.80x	~3.2 GB	Maximum context on limited VRAM
turbo4	4.25 bpv	3.8x	~4.4 GB	Higher fidelity, ample memory
FP16 baseline	16 bpv	1x	~16.7 GB	Reference (no compression)

Multi-Agent Swarms

Specialist agents share a single compressed KV cache. Agent B attends to context generated by Agent A without re-encoding. TurboQuant's inner-product fidelity ensures attention accuracy across agent boundaries.

from turboagent.agents.swarm import (
    TurboSwarm, SwarmAgent
)

swarm = TurboSwarm(
    "Qwen/Qwen2.5-32B-Instruct",
    quantize_weights="nf4",
    agents=[
        SwarmAgent(name="researcher", role="research"),
        SwarmAgent(name="critic", role="review"),
        SwarmAgent(name="writer", role="writing"),
    ],
)

results = swarm.run("Analyze KV cache compression.")

Validated at Scale

Cross-model, cross-hardware validation with production models.

Model	Hardware	VRAM	Max Context	NIAH Accuracy
Qwen2.5-32B (NF4)	Single RTX 4090	24 GB	65,000 tokens	15/15 (100%)
Qwen2.5-32B (BF16)	RTX PRO 6000 × 2	192 GB	96,000 tokens	25/25 (100%)
Gemma 4 31B (BF16)	RTX PRO 6000 × 2	192 GB	96,000 tokens	20/20 (100%)
Qwen2.5-7B (BF16)	Single RTX 4090	24 GB	16,000 tokens	5/5 (100%)

Total: 65/65 NIAH depths PASS across 4 model/hardware configurations. Combined Streaming KV + NF4 path enables the "32B on 24 GB" headline result.

TurboAgent Enterprise

SSO, audit logging, compliance exports (SOC-2, GDPR), governance policies, multi-node KV sharing, priority kernels, dedicated support.

The open-source core is free forever under the MIT license.

Contact Sales