The AI wave is driving memory demand and sending prices soaring. However, the market still worries whether HBM will repeat a “cyclical” bust—just like DRAM in the past—rapidly reversing after demand peaks. Semiconductor architecture analyst fin points out that HBM’s demand logic has already broken away from traditional memory industry rules, and is being re-priced through tokens.

(Memory rally slowing? Institutions see Q2 gains narrowing to 30%, with further cooling in the second half)

Memory in the CPU era: an optional accessory

fin notes that in the age when CPUs led computing, DDR memory has always played only a supporting role. CPU engineers developed a whole set of architecture techniques to mask memory latency, including super-scalar design, multi-level caches, and register renaming, allowing processors to maintain high performance without relying on high-speed memory conditions:

An industry rule of thumb is that even if DDR bandwidth doubles directly, overall CPU performance gains often do not exceed 20%.

This architecture directly shaped the growth cadence of the DRAM industry over the past decades. From DDR3 to DDR5 took a full 15 years; over the past 10 years, typical PC DDR capacity rose from 7 to 8 GB to about 23 GB—only triple growth in a decade. The main source of profits for DRAM makers comes from capacity size, while bandwidth upgrades are merely a way to raise the unit selling price.

In the CPU era, memory is the lowest marginal utility segment within the chip industry. Cyclical ups and downs are both normal and inevitable.

As the AI inference era arrives, it fundamentally rewrites the value standard of memory

However, when the compute protagonist switches to AI inference engines, the yardstick changes as well. Chipmakers no longer compete on how many floating-point operations per second they can do; now there is only one core KPI in the AI era: how many tokens can be generated per unit cost and per unit power consumption.

NVIDIA CEO Jensen Huang’s “AI factory” concept precisely captures this new logic: the purpose of an AI factory is to produce the most tokens at the lowest cost, while pushing the token output rate to the limit. The optimization goal expands from a single dimension outward: it must both maximize total token throughput and also pursue the token output speed per request.

This KPI shift is the starting point for HBM’s fate reversal.

Token throughput formula reveals the first principles behind HBM demand

fin breaks down AI inference token throughput into the product of two parameters: “the number of request batches processed concurrently × the average token generation speed per request.” By tracing the bottlenecks of each parameter, the answer points to the same component.

The batch-count bottleneck lies in HBM storage capacity. Every inference request carries its own KV cache—the mechanism that stores intermediate states during model inference—and this cache must be stored in HBM immediately so that the model can repeatedly perform high-speed reads each time it generates a token. The larger the batch size, the more storage HBM needs, and the two are linearly proportional.

The token-speed bottleneck, meanwhile, lies in HBM bandwidth. In the model decoding stage, each generated token requires repeated reads of massive startup weights and the KV cache. The read speed directly determines token generation efficiency, and the upper limit of read speed is HBM bandwidth.

He says this relationship can be likened to airport shuttle buses: HBM capacity is the size of the bus cabin, which determines how many passengers can be carried at once; HBM bandwidth is the width of the doors, which determines how fast passengers can board and exit; the total passenger transport throughput is the product of cabin size and above-bus speed. From this, the first principle of AI inference hardware demand follows:

Token throughput = HBM capacity × HBM bandwidth

To keep per-generation GPU token throughput growing by 2x, the product of HBM capacity and HBM bandwidth must double each generation.

Software optimization can’t solve it—HBM demand is locked onto an exponential track

Given this reasoning, the most common market objection is: can’t software optimization reduce dependence on HBM? His answer is that software efficiency and hardware specifications are two completely independent dimensions, and they do not replace each other. It’s like even if CPU software optimization becomes thorough, it still can’t stop Intel or AMD from needing to deliver higher benchmark scores in standard tests each generation; otherwise the products won’t sell.

GPU logic is exactly the same: as long as global demand for tokens keeps expanding, the pursuit of higher token throughput will not stop, and the demand for progress in both sides of HBM will not stop either.

More importantly, this pressure does not come from external business-cycle tailwinds; it is endogenous demand on the supply side. As long as NVIDIA is selling the next generation of GPUs, it will inevitably pressure SK hynix, Samsung, and Micron to push synchronous advances in each generation of HBM in both capacity and bandwidth—because the ceiling of HBM is the ceiling of GPU performance.

If you plot the token throughput of NVIDIA GPUs from A100 to Rubin Ultra across generations against the corresponding values of “HBM capacity × HBM bandwidth” on the same log-log coordinate chart, the degree of alignment between the two curves will be striking. This is not a historical coincidence; it is an inevitable outcome of system optimization.

HBM says goodbye to cyclical fate, but market pricing logic still needs re-evaluation

Based on the above architectural deductions, the fundamental difference between HBM and traditional DRAM is already clear. Traditional memory is an accessory to the chip industry, with weak demand drivers; once the production expansion schedule surpasses demand recovery, cyclical price collapses arrive as expected.

But HBM demand has been locked onto an exponential growth track by the physical logic of AI inference architecture. This has no direct causal relationship with the overall market temperature of AI or with the economic business-cycle swings.

Of course, the real problem is not on the demand side, but on the supply side: when facing strong demand, can SK hynix, Samsung, and Micron restrain the blind impulse to expand capacity that has been repeated over the past decades—and avoid planting the cycle curse of supply exceeding demand again? The answer to this question will be the key variable for whether this memory cycle can be sustained long term.

(Can memory stocks still be bought after a crash? Samsung Securities analyst: corrections within the cycle, not a market-top signal)

This article breaks the cyclical myth! One formula dissects the HBM demand structure: why memory will only keep rising? First published on Lianxin ABMedia.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.