Flash

Everpure says TurboQuant turns KV cache into a storage problem

Published

Everpure blogger Robert Alvarez says FlashBlade and the TurboQuant compression means "the KV cache is no longer a memory capacity problem. It’s a storage I/O problem, and storage I/O is a problem Everpure knows how to solve."

TurboQuant is described in a Google research paper, and is a mathematical way of near-losslessly compressing the vector data in a GPU's KV cache, which resides in its high-bandwidth memory (HBM), by 5x.

Alvarez has a PhD in applied mathematics and is a Senior AI Solutions Architect at Everpure. He has written two TurboQuant-related blogs, TurboQuant Compresses KV Cache by 5X. Does That Mean You Need Less Memory? with Everpure's Jean-Baptiste Thomas, and Up to 10X Faster KV Cache Restore: TurboQuant Meets FlashBlade.

Robert Alvarez
Robert Alvarez

He lays out the basic problem: "Running a 70B parameter model for 128 concurrent users at 32K context consumes roughly 2.6 TB of KV cache memory. That's HBM on your GPUs, the scarcest and most expensive resource in your inference stack. An H100 gives you 80 GB of it. Even the B200, Nvidia's current flagship, tops out at 192 GB."

If the KV cache data can be compressed, more will fit in HBM. A 5x compression of 2.6 TB turns it into 520 GB, still more than a B200's 192 GB, but now you need 3 x H100s instead of 14 x H100s to hold the data in HBM. 

Referring to a diagram (below), Alvarez describes the main part of TurboQuant's operation like this: "The TurboQuant pipeline in four steps: Start with raw 128-dimensional vectors (1), extract magnitude and apply a random orthogonal rotation (2), quantize each coordinate to one of eight levels (3 bits) using the concentrated post-rotation distribution (3), then reverse the process to reconstruct (4). Original (blue) vs. reconstructed (red) shows cosine similarity of 0.94 on synthetic data; real model KV tensors achieve higher fidelity."

Everpure TurboQuant process diagram.
Everpure TurboQuant process diagram

Alvarez says: "If the KV cache reservation per user drops by 3-5x, that's HBM you get back. And that headroom opens three doors that were previously closed. First, you can run bigger models on the same GPUs... Second, compressed caches are faster to move. A 1GB compressed cache transfers between nodes in your AI factory in a fraction of the time a 4.6 GB uncompressed one does... Third, and this is the Jevons Paradox point, easing the per-session HBM requirement doesn't reduce total memory demand. It increases it because workloads that were previously too expensive to run become feasible."

FlashBlade comes in as an evicted KV cache data storage resource. Alvarez's blog summary reads: "TurboQuant on FlashBlade can deliver up to 10X faster KV cache restores with 5X compression, enabling scalable LLM inference for long-context AI workloads." A chart showing average restore times across eight concurrent GPUs illustrates this:

FlashBlade TurboQuant chart.
FlashBlade TurboQuant chart

Alvarez says: "The benchmarks ran on an Nvidia DGX system with eight A100-40GB GPUs connected to an Everpure FlashBlade over NFS with RDMA. The model was Qwen2.5-7B-Instruct, which has a head dimension of 128, the dimensionality that matters for TurboQuant because the concentration of measure phenomenon that makes the algorithm work gets stronger as dimension increases. The KV cache sizes scale linearly with context length at a consistent 4.9X reduction."

"Nvidia has stated that its AI factory architecture requires 16 TB of KV cache storage per GPU... That requirement was defined before TurboQuant existed. At 3-bit compression, that 16 TB per GPU drops to 3.3 TB. Multiply across a 1,000-GPU cluster and TurboQuant takes the storage footprint from 16 petabytes to 3.3 petabytes."

We think DDN, VAST Data, WEKA, and Nvidia's other CMX partners will certainly be checking this out.