TurboQuant: A Little Hope While Big Tech Companies Berserk on AI

Dex-chan lover
Joined
Sep 17, 2025
Messages
150
I don't know does people on this forum already know this news or not, but the AI bubble that was skyrocketing RAM prices already start popping out a little.

I consider this thread as the continuation of this past thread. Or not? Since that past thread was already created on late 2024, I think there's no major relation, but since the topic is the same, I consider this one as the continuation.

Ahem, let's continue it. So, what's the news?

TurboQuant

About 2-3 weeks ago, Google Research Team introducing a data processing algorithm that claimed can resolving the AI bottleneck on latest research: Key-Value Cache (KV Cache).

What is KV Cache?

KV Cache is a method that used on LLM Model (such as ChatGPT, Gemini, Claude, etc) to remembering the context. The chat history and embedded files are saving as cache in VRAM GPU so that the model doesn't need to calculating the response (including the context understanding) from very early. At simple, this is the AI way to remembering the topic so the chat can flow easily and feel more like human.

The problem lies here. The longer the chat goes, the more numerous and complex the files being embedded, the KV Cache will grow larger and larger and start consuming GPU memory more and more. If you reading a lot of comic chapter on your browser at phone, the cache will growing larger, but those image caches lies on the internal memory, not the phone RAM. And the KV Cache is lies on RAM and GPU. Your AI will growing slower, or at worst crashed for memory leaking.

That's the reason your browser, or even your computer start becoming slower when the chat already reaching too long or the data was already too much. I've experienced it myself once on ChatGPT in 2024/2025 while consulting with him about my college task (about coding) and I asking and sending code too much to the point the AI will making the computer slower just from opening the chat. That time, I avoid it with opening a new chat. Of course, all the context got reset and I need to explain the context from earlier again.

So, what is Google Research Team solution for this? The answer is TurboQuant. It claimed to be able to compressing the KV Cache around 4-5 times so the model can remember longer context, more complicated and numerous embedded file, and the GPU or RAM will saving more memory.


How Does TurboQuant Works?

TurboQuant works on a step-by-step approach below.

Random Rotation

First, they randomly rotate the input vectors using Johnson-Lindestrauss transform concept. It resulting the data become uniformly distributed, inducing Beta distribution on each coordinate. This rotation alo mades the distributed data becomes almost independent and nearly uncorrelated. To put it simply, imagine we having a messed tangled ropes. By this "random rotation" technique, we able to made the ropes can be separated with uniformly length and no more messed again.

Optimal Scalar Quantization

The uncorrelated uniformed data, or the "uniformed unmessed ropes", contributes significantly to be quantized (mapping high-precision decimal numbers to much smaller discrete symbols or numbers) standardly. Back to the rope example, it was really easier to cutting the rope to shorter ropes if it no more messed tangled again, right? That's what happen here.

But, here arises a problem. standard compression methods make the data slightly inaccurate, causing a "bias" when the AI calculates its attention scores (inner product). In fact, attention scores is a vital point for LLM model so user can having depth discuss with the model without worrying to always explaining the context to the model.

So, here's the place the third step applied.

High-Precised Bias Correction (Using QJL Approach)

Quantized Johnson-Lindenstrauss (QJL) algorithm is recently proposed on this data compression topic. This algorithm is optimal to quantizing a bit-width of one. It help data resulted from the second steps not more biased and can approaching Shanon Lower Bound prediction precisely following the increasing of the data's bit. It helps model to provides an unbiased result with more compressed data. Here's the visualization.

Bit total in data1 bit data2 bit data3 bit data4 bit data8 bit data
Visualization
screenshot-69.png
screenshot-70.png
screenshot-71.png
screenshot-72.png
screenshot-73.png

The grey wave visualizing the output target that prediceted by Shanon Lower Bound and the blue wave is the biased compressed gaussian data that was quantized using QJL. As we can see, the more the bit increased, the softer blue wave seems which mean it more approaching the predicted output target. It also visualized in this figure.
screenshot-74.png

Why Does TurboQuant Is A Game Changer?

Before TurboQuant, there's already researches on approach to compressing (quantizing) the KV Cache, but LLM model is really sensitive about the change of data vector direction. Improper compression will cause the model lost its focus (attention scores) and suddenly becomes "stupid" again, or start hallucinating. But, TurboQuant solve this problem with explained steps above. TurboQuant can guarantee that even the input data got compressed extremely, it still able to help the model providing unbiased result and nearly precise on its objective.

TLDR Version

In summary, TurboQuant really effective to help the input data compression without making the model lost attention scores or start hallucinating. So, model can remember longer chat history, more complex, and more numerous data.

The Effect

Even this paper still need to be proven on real case testing, it significantly affects the AI bubble following the AI research trend. Several hours after the paper was published and Google announced it, reported that several major tech companies stock that heavily investing on AI research hardware such as Micron Technology, Nvidia, Samsung, etc start tumbling. While it not severe enough to make the company bankrupt, but of course it affect them significantly and help market price correction. More importantly, GPU prices are reportedly starting to correct slowly. Of course, it is a good news for us a regular gaming enthusiast.

Closing

Hopefully, the algorithm really able to make AI consuming less memory as it claimed so that this GPU price correction can be on good progress for public demand. Also, hopefully the AI researcher can even inventing more and more helpful algorithm to help us can be enjoying our hobby (playing games) without having to putting a fight first with the big companies.


Further links: Google Research Blog | ArXiv Paper
 
Last edited:
Dex-chan lover
Joined
Jan 25, 2023
Messages
11,762
Nah they won't solve it. Their core hypothesis is wrong after all. If you continue going to the wrong direction you will just go roundabout then fall into the pit.
 
Dex-chan lover
Joined
Jan 25, 2023
Messages
11,762
I just the AI craze stops sooner or later. I know we can never go back, but I really dislike AI being shoved down my throat
I hate it as much, but realistically looking the elites have a lot of companies to crash still. There are no signs of people stopping using this platform.

It is not learning if this thing cannot even correct itself. I call it, "The Matrix".
 
Double-page supporter
Joined
Aug 1, 2019
Messages
55
Just to throw in some detail for the one or two people glossing over this thread, since I already had the snippets laying around.

Normally, AI model weights are stored in FP16 (16-bit). Imagine every single "thought" in the model's head is written out as a 16-digit decimal. It’s precise, but it’s a storage nightmare. It’s like trying to text your friend using only high-resolution 4K images of handwritten letters. It’s beautiful, but your data plan (VRAM) is going to die in five minutes.

Standard quantization (like 4-bit) says: "Hey, we don't need 16 digits. Let's just use 4." It rounds the numbers off. But if you round off the wrong numbers, the model gets stupid. It starts forgetting how to code or thinks 2+2 is 5.

So what makes TurboQuant.. Turbo? It uses Hardware-Aware Model Compression. This has 4 relevant parts:

  • Outlier Preservation: In any AI model, 99% of the weights are background noise, but 1% are "Main Characters" that hold all the logic. TurboQuant identifies these "Main Character" weights and keeps them in high resolution (FP16), while crushing the background noise down to 2-bit or 4-bit.
  • Kernel Fusion: Usually, your GPU has to decompress the weight, do the math, and then move to the next one. TurboQuant "fuses" these steps. It does the math while the data is still compressed inside the GPU’s fast memory (SRAM). No unpacking needed.
  • The 4090 Special: It uses specialized instructions on modern NVIDIA cards (Tensor Cores) that are physically built to do "low-precision math" at lightning speed. This is why you get 10-15 tokens/s on a Gemma 4 26B instead of the model just crashing your PC.
  • KV Cache Shrinkage: It doesn't just compress the model; it compresses the conversation history. This lets you have massive 128k context windows without the model "forgetting" the start of the chat or eating your VRAM for breakfast.
 
Dex-chan lover
Joined
Sep 17, 2025
Messages
150
Wow, not expecting someone gonna reply on this thread again lol.
Nah they won't solve it. Their core hypothesis is wrong after all. If you continue going to the wrong direction you will just go roundabout then fall into the pit.
Interesting. Can you elaborate why are you states like that? I'm really curious :huh:
Just to throw in some detail for the one or two people glossing over this thread, since I already had the snippets laying around.
Thank you for providing more detailed context with simplified explanations hehe. I'm not really able to explain something with something simpler so your explanation really help me a lot.
 
Dex-chan lover
Joined
Jan 25, 2023
Messages
11,762
Quoting an true expert Latinist from his response to a proposal for making a model learning AI for Latin:-

"If it cannot learn Latin from the sources readily available in the Internet, then it cannot understand Latin." Emphasis on cannot.

This applies also to simple things like cooking or even sweeping.
 
Dex-chan lover
Joined
Jan 25, 2023
Messages
11,762
Quoting an true expert Latinist from his response to a proposal for making a model learning AI for Latin:-

"If it cannot learn Latin from the sources readily available in the Internet, then it cannot understand Latin." Emphasis on cannot.

This applies also to simple things like cooking or even wash your socks.

It will just be an auto complete in the end used for wrong ends.
 
Dex-chan lover
Joined
Jan 25, 2023
Messages
11,762
The effect shows that AI use caused four big tech companies to almost crash and their solution is to fire employees, because of the CEO and directors' belief on the sunk cost fallacy. I see that even the AI essay for "the effect " had the sad case of bad grammar. Truly written like something similar to an ETL.

Way to go. Those fired people better start a new company and a new currency so we can watch the elites beg for talents on the street.
 

Users who are viewing this thread

Top