TurboAgent brings Google Research's TurboQuant KV-cache compression to open-source LLMs. 4.9x memory reduction, zero accuracy loss, one-line setup.
Auto-detects your hardware, selects the optimal backend, configures TurboQuant compression, and manages multi-turn memory. Just import and run.
KV cache for 70B at 128k context: ~4 GB instead of ~20 GB.
from turboagent import TurboAgent agent = TurboAgent( "meta-llama/Llama-3.1-70B-Instruct", kv_mode="turbo3", context=131072, ) response = agent.run( "Analyze my 50k-token research doc..." ) # KV cache: 4 GB (was 20 GB)
Everything you need for long-context agentic AI on consumer hardware.
PolarQuant + QJL algorithms from the paper. Precomputed Lloyd-Max codebooks. Bit-packed storage for true 4.9x compression.
Detects CUDA, ROCm, Metal, CPU. Selects backend, layer offloading, context size, and compression mode automatically.
llama.cpp for consumer GPUs, vLLM for server throughput, PyTorch for research. One unified API.
Shared compressed KV pool across agents. Round-robin or custom routing. 5x less memory than independent contexts.
Inner-product-preserving search using TurboQuant codebooks. Superior recall vs. Product Quantization.
OpenAI-compatible /v1/chat/completions. Docker deployment with GPU passthrough. Auth + rate limiting built in.
Validated against the paper's theoretical bounds on real hardware.
| Mode | Bits/Value | Compression | 70B KV @ 128k | Best For |
|---|---|---|---|---|
| turbo3 | 3.25 bpv | 4.9x | ~4 GB | Maximum context on limited VRAM |
| turbo4 | 4.25 bpv | 3.8x | ~5.3 GB | Higher quality, ample memory |
| FP16 baseline | 16 bpv | 1x | ~20 GB | Reference (no compression) |
Specialist agents share a single compressed KV cache. Agent B attends to context generated by Agent A without re-encoding. TurboQuant's inner-product fidelity ensures attention accuracy across agent boundaries.
from turboagent.agents.swarm import ( TurboSwarm, SwarmAgent ) swarm = TurboSwarm( "meta-llama/Llama-3.1-70B-Instruct", agents=[ SwarmAgent(name="researcher", role="research"), SwarmAgent(name="critic", role="review"), SwarmAgent(name="writer", role="writing"), ], ) results = swarm.run("Analyze KV cache compression.")
Tested on real hardware with production models.
| Model | GPU | VRAM | Compression | Multi-Turn Recall |
|---|---|---|---|---|
| Qwen2.5-32B | RTX PRO 6000 | 96 GB | 5.28x | Recalls facts after compress/decompress |
| Qwen2-0.5B | RTX 4070 | 8 GB | 4.9x | Bridge verified (quality limited by model size) |
SSO, audit logging, compliance exports (SOC-2, GDPR), governance policies, multi-node KV sharing, priority kernels, dedicated support.
The open-source core is free forever under the MIT license.