Dex-chan lover
- Joined
- Sep 17, 2025
- Messages
- 106
I don't know does people on this forum already know this news or not, but the AI bubble that was skyrocketing RAM prices already start popping out a little.
I consider this thread as the continuation of this past thread. Or not? Since that past thread was already created on late 2024, I think there's no major relation, but since the topic is the same, I consider this one as the continuation.
Ahem, let's continue it. So, what's the news?
TurboQuant
About 2-3 weeks ago, Google Research Team introducing a data processing algorithm that claimed can resolving the AI bottleneck on latest research: Key-Value Cache (KV Cache).
The problem lies here. The longer the chat goes, the more numerous and complex the files being embedded, the KV Cache will grow larger and larger and start consuming GPU memory more and more. If you reading a lot of comic chapter on your browser at phone, the cache will growing larger, but those image caches lies on the internal memory, not the phone RAM. And the KV Cache is lies on RAM and GPU. Your AI will growing slower, or at worst crashed for memory leaking.
That's the reason your browser, or even your computer start becoming slower when the chat already reaching too long or the data was already too much. I've experienced it myself once on ChatGPT in 2024/2025 while consulting with him about my college task (about coding) and I asking and sending code too much to the point the AI will making the computer slower just from opening the chat. That time, I avoid it with opening a new chat. Of course, all the context got reset and I need to explain the context from earlier again.
So, what is Google Research Team solution for this? The answer is TurboQuant. It claimed to be able to compressing the KV Cache around 4-5 times so the model can remember longer context, more complicated and numerous embedded file, and the GPU or RAM will saving more memory.
But, here arises a problem. standard compression methods make the data slightly inaccurate, causing a "bias" when the AI calculates its attention scores (inner product). In fact, attention scores is a vital point for LLM model so user can having depth discuss with the model without worrying to always explaining the context to the model.
So, here's the place the third step applied.
The grey wave visualizing the output target that prediceted by Shanon Lower Bound and the blue wave is the biased compressed gaussian data that was quantized using QJL. As we can see, the more the bit increased, the softer blue wave seems which mean it more approaching the predicted output target. It also visualized in this figure.
Further links: Google Research Blog | ArXiv Paper
I consider this thread as the continuation of this past thread. Or not? Since that past thread was already created on late 2024, I think there's no major relation, but since the topic is the same, I consider this one as the continuation.
Ahem, let's continue it. So, what's the news?
TurboQuant
About 2-3 weeks ago, Google Research Team introducing a data processing algorithm that claimed can resolving the AI bottleneck on latest research: Key-Value Cache (KV Cache).
What is KV Cache?
KV Cache is a method that used on LLM Model (such as ChatGPT, Gemini, Claude, etc) to remembering the context. The chat history and embedded files are saving as cache in VRAM GPU so that the model doesn't need to calculating the response (including the context understanding) from very early. At simple, this is the AI way to remembering the topic so the chat can flow easily and feel more like human.The problem lies here. The longer the chat goes, the more numerous and complex the files being embedded, the KV Cache will grow larger and larger and start consuming GPU memory more and more. If you reading a lot of comic chapter on your browser at phone, the cache will growing larger, but those image caches lies on the internal memory, not the phone RAM. And the KV Cache is lies on RAM and GPU. Your AI will growing slower, or at worst crashed for memory leaking.
That's the reason your browser, or even your computer start becoming slower when the chat already reaching too long or the data was already too much. I've experienced it myself once on ChatGPT in 2024/2025 while consulting with him about my college task (about coding) and I asking and sending code too much to the point the AI will making the computer slower just from opening the chat. That time, I avoid it with opening a new chat. Of course, all the context got reset and I need to explain the context from earlier again.
So, what is Google Research Team solution for this? The answer is TurboQuant. It claimed to be able to compressing the KV Cache around 4-5 times so the model can remember longer context, more complicated and numerous embedded file, and the GPU or RAM will saving more memory.
How Does TurboQuant Works?
TurboQuant works on a step-by-step approach below.Random Rotation
First, they randomly rotate the input vectors using Johnson-Lindestrauss transform concept. It resulting the data become uniformly distributed, inducing Beta distribution on each coordinate. This rotation alo mades the distributed data becomes almost independent and nearly uncorrelated. To put it simply, imagine we having a messed tangled ropes. By this "random rotation" technique, we able to made the ropes can be separated with uniformly length and no more messed again.Optimal Scalar Quantization
The uncorrelated uniformed data, or the "uniformed unmessed ropes", contributes significantly to be quantized (mapping high-precision decimal numbers to much smaller discrete symbols or numbers) standardly. Back to the rope example, it was really easier to cutting the rope to shorter ropes if it no more messed tangled again, right? That's what happen here.But, here arises a problem. standard compression methods make the data slightly inaccurate, causing a "bias" when the AI calculates its attention scores (inner product). In fact, attention scores is a vital point for LLM model so user can having depth discuss with the model without worrying to always explaining the context to the model.
So, here's the place the third step applied.
High-Precised Bias Correction (Using QJL Approach)
Quantized Johnson-Lindenstrauss (QJL) algorithm is recently proposed on this data compression topic. This algorithm is optimal to quantizing a bit-width of one. It help data resulted from the second steps not more biased and can approaching Shanon Lower Bound prediction precisely following the increasing of the data's bit. It helps model to provides an unbiased result with more compressed data. Here's the visualization.| Bit total in data | 1 bit data | 2 bit data | 3 bit data | 4 bit data | 8 bit data |
| Visualization |
|
|
|
|
|
The grey wave visualizing the output target that prediceted by Shanon Lower Bound and the blue wave is the biased compressed gaussian data that was quantized using QJL. As we can see, the more the bit increased, the softer blue wave seems which mean it more approaching the predicted output target. It also visualized in this figure.
Why Does TurboQuant Is A Game Changer?
Before TurboQuant, there's already researches on approach to compressing (quantizing) the KV Cache, but LLM model is really sensitive about the change of data vector direction. Improper compression will cause the model lost its focus (attention scores) and suddenly becomes "stupid" again, or start hallucinating. But, TurboQuant solve this problem with explained steps above. TurboQuant can guarantee that even the input data got compressed extremely, it still able to help the model providing unbiased result and nearly precise on its objective.TLDR Version
In summary, TurboQuant really effective to help the input data compression without making the model lost attention scores or start hallucinating. So, model can remember longer chat history, more complex, and more numerous data.The Effect
Even this paper still need to be proven on real case testing, it significantly affects the AI bubble following the AI research trend. Several hours after the paper was published and Google announced it, reported that several major tech companies stock that heavily investing on AI research hardware such as Micron Technology, Nvidia, Samsung, etc start tumbling. While it not severe enough to make the company bankrupt, but of course it affect them significantly and help market price correction. More importantly, GPU prices are reportedly starting to correct slowly. Of course, it is a good news for us a regular gaming enthusiast.Closing
Hopefully, the algorithm really able to make AI consuming less memory as it claimed so that this GPU price correction can be on good progress for public demand. Also, hopefully the AI researcher can even inventing more and more helpful algorithm to help us can be enjoying our hobby (playing games) without having to putting a fight first with the big companies.Further links: Google Research Blog | ArXiv Paper
Last edited: