Flash

Everpure says TurboQuant turns KV cache into a storage problem

Chris Mellor Chris Mellor Blocks & Files editor

Published fri 10 Apr 2026 // 16:00 UTC

Everpure blogger Robert Alvarez says FlashBlade and the TurboQuant compression means "the KV cache is no longer a memory capacity problem. It’s a storage I/O problem, and storage I/O is a problem Everpure knows how to solve."

TurboQuant is described in a Google research paper, and is a mathematical way of near-losslessly compressing the vector data in a GPU's KV cache, which resides in its high-bandwidth memory (HBM), by 5x.

Alvarez has a PhD in applied mathematics and is a Senior AI Solutions Architect at Everpure. He has written two TurboQuant-related blogs, TurboQuant Compresses KV Cache by 5X. Does That Mean You Need Less Memory? with Everpure's Jean-Baptiste Thomas, and Up to 10X Faster KV Cache Restore: TurboQuant Meets FlashBlade.

BANDF AD

He lays out the basic problem: "Running a 70B parameter model for 128 concurrent users at 32K context consumes roughly 2.6 TB of KV cache memory. That's HBM on your GPUs, the scarcest and most expensive resource in your inference stack. An H100 gives you 80 GB of it. Even the B200, Nvidia's current flagship, tops out at 192 GB."

If the KV cache data can be compressed, more will fit in HBM. A 5x compression of 2.6 TB turns it into 520 GB, still more than a B200's 192 GB, but now you need 3 x H100s instead of 14 x H100s to hold the data in HBM.

Referring to a diagram (below), Alvarez describes the main part of TurboQuant's operation like this: "The TurboQuant pipeline in four steps: Start with raw 128-dimensional vectors (1), extract magnitude and apply a random orthogonal rotation (2), quantize each coordinate to one of eight levels (3 bits) using the concentrated post-rotation distribution (3), then reverse the process to reconstruct (4). Original (blue) vs. reconstructed (red) shows cosine similarity of 0.94 on synthetic data; real model KV tensors achieve higher fidelity."

BANDF AD

Everpure TurboQuant process diagram. — Everpure TurboQuant process diagram

Alvarez says: "If the KV cache reservation per user drops by 3-5x, that's HBM you get back. And that headroom opens three doors that were previously closed. First, you can run bigger models on the same GPUs... Second, compressed caches are faster to move. A 1GB compressed cache transfers between nodes in your AI factory in a fraction of the time a 4.6 GB uncompressed one does... Third, and this is the Jevons Paradox point, easing the per-session HBM requirement doesn't reduce total memory demand. It increases it because workloads that were previously too expensive to run become feasible."

FlashBlade comes in as an evicted KV cache data storage resource. Alvarez's blog summary reads: "TurboQuant on FlashBlade can deliver up to 10X faster KV cache restores with 5X compression, enabling scalable LLM inference for long-context AI workloads." A chart showing average restore times across eight concurrent GPUs illustrates this:

FlashBlade TurboQuant chart. — FlashBlade TurboQuant chart

Alvarez says: "The benchmarks ran on an Nvidia DGX system with eight A100-40GB GPUs connected to an Everpure FlashBlade over NFS with RDMA. The model was Qwen2.5-7B-Instruct, which has a head dimension of 128, the dimensionality that matters for TurboQuant because the concentration of measure phenomenon that makes the algorithm work gets stronger as dimension increases. The KV cache sizes scale linearly with context length at a consistent 4.9X reduction."

"Nvidia has stated that its AI factory architecture requires 16 TB of KV cache storage per GPU... That requirement was defined before TurboQuant existed. At 3-bit compression, that 16 TB per GPU drops to 3.3 TB. Multiply across a 1,000-GPU cluster and TurboQuant takes the storage footprint from 16 petabytes to 3.3 petabytes."

BANDF AD

We think DDN, VAST Data, WEKA, and Nvidia's other CMX partners will certainly be checking this out.

nvidia ai everpure flash

Everpure says TurboQuant turns KV cache into a storage problem

Norway’s 2 petabytes of Huawei flash storage and LLM training

Storage news ticker - 22 May

LucidLink CEO says it's needed for AEC data center boom

Kioxia rides the AI wave to record revenues and a US listing

Huawei’s new stacking tech for high-capacity SSDs

Commvault sees ResOps as a business model, not malware prevention/recovery mechanics

PowerStore gets performance and capacity upgrades - and there’s more

Everpure’s immutable snapshots provide accelerated malware attack recovery

Dell's AI Factory getting supercharged storage

WD securing disk drives with post-quantum cryptography

Redis agentic AI flowers with Iris

Scality says Samsung is developing nearline SSDs up to 1 PB

Kioxia and Dell cram 10 PB into slim 2RU server

Kioxia launches XG10 PCIe 5.0 client SSD

HPE updates Alletras X and B10000, Zerto and Data Fabric in GreenLake private cloud update blast

The storage refresh that outlives the flash cycle

Scality’s Autonomous Data Infrastructure does agent-driven tiering and more

MinIO adds petabyte-scale MemKV cache for Nvidia GPU inference

MSP-focussed Virtuozzo goes all-in on AI

Ten enterprise AI storage systems reviewed and reported

DRAM and gloom-glut cyclicality

DDN storage being used in French Pangea supercomputer