Preferred Networks
L1000 Hero Image

MN-Core L1000

A Chip for the Inference-Time Scaling Era

TL;DR

MN-Core L1000: Delivering ultra-fast long-context inference for 70B+ LLMs on a single server—coming in 2026.

System size vs LLM inference performance (Llama 3.1 70B)
10x performance improvement within the same system size. No massive rack system needed. You don't even need a server room. MN-Core L1000 will bring ultra-fast inference to your server, even to your workstation. Simply plug an L1000 card into your PCIe slot and enjoy ultra-fast LLM inference.
10x performance improvement for longer inference. Our chip addresses the main bottleneck in LLM inference, memory bandwidth, and excels at handling longer contexts, which is crucial for LLMs in the inference-time scaling era.

The New Scaling Law: Think Deeper by Thinking Longer!

In recent years, LLM performance has grown dramatically, driven by an empirical principle known as the "scaling law." This principle suggests that increasing computational power, dataset size, and model size during training leads to improved model intelligence. Many organizations have expanded their LLM training based on this principle. However, with most high-quality datasets already utilized and data centers approaching their capacity limits, the sustainability of this traditional training-time scaling law is increasingly uncertain.
Training Time Scaling Law
Training-time Scaling Law [Kaplan+ 20]: Larger datasets, more computation, and more parameters can enhance model intelligence. But is this approach sustainable?
In September 2024, OpenAI introduced its latest LLM "OpenAI o1." This release marked a new phase in scaling—shifting focus from training larger models to managing longer contexts during inference. The OpenAI o1 model generates hidden tokens, visible only to itself, enabling it to think, reflect, and refine its internal reasoning before producing final output—similar to how humans take notes while thinking. This approach has yielded significant improvements, particularly in STEM-related tasks. While other inference-time improvements exist, such as selecting optimal outputs from parallel candidates, OpenAI appears to have demonstrated the power of long inference to enhance inference quality.
Inference time scaling
New Inference-time Scaling Law: Model intelligence improves through longer inference, not larger training.
These advancements, however, come with significantly higher inference costs. OpenAI recommends allocating 25,000 tokens specifically for the o1-preview model's internal processing. These hidden tokens are also included in the output token count. Combined with the increased cost per token, this results in overall expenses growing at a rate greater than proportional to the context length. To make long inferences practical, it is essential to address these escalating costs. Let’s explore the cost of LLM inference in the era of inference-time scaling.
API Cost of OpenAI-o1
API cost is on average 33.6x higher!

Memory Bandwidth is All You Need

The cost per token can be broken down into a simple equation:
Cost per Token equation
Therefore, we can reduce costs by either increasing the inference speed per user or increasing the number of concurrent users.
Inference speed per user is determined primarily by memory bandwidth. This is because model sizes are so large that the time needed to generate each token is dominated by the time required to load the model (and KV cache, which we'll discuss later) from memory into the accelerator.
Current GPU systems cannot increase tokens per user due to their limited bandwidth (which we will discuss later). Instead, they reduce costs by increasing the number of concurrent users through a technique called "batching." Technologies like "continuous batching" enable efficient batch inference.
However, in the era of inference-time scaling, "batching" alone isn't sufficient. To understand why, we need a deeper understanding of how LLMs work.
LLMs rely heavily on key-value (KV) cache mechanisms, which store previously processed context and output tokens, serving as the model's "memory." For example, each token (typically a sub-word chunk) requires 320KB of accelerator memory as KV cache when running Llama 3.1 70B with FP16 precision.
1 token uses 320KB on accelerator's memory
One token's worth of KV cache might seem small, but with inference-time scaling pushing toward longer outputs, it adds up quickly. For example, with a context length of 128K tokens, the KV cache alone requires 41GB of memory! While techniques exist to compress the KV cache at the cost of some performance, the trend toward inference-time scaling suggests we'll need even higher context length in the future. This means memory demands will continue to grow.
128K tokens use 41GB KV cache
Consequently, longer inference requires more memory per user, reducing the total number of users an accelerator can support compared to shorter inferences. While existing systems use "batching" to increase concurrent users, the large memory requirements severely limit this approach. This limitation is a key factor in the high per-token cost of long-inference models.
Longer inference requires more memory on accelerator
If we cannot increase the number of users, what options remain? Looking back at the equation, we see that in the inference-time scaling era, the only way to reduce inference costs is to increase tokens per second. And since tokens per second is determined by bandwidth, memory bandwidth has become the critical factor.
Cost per Token equation

Current Memory Solutions: SRAM vs HBM

Most modern computers use DRAM, an affordable, high-capacity memory found in personal computers and smartphones. However, its memory bandwidth was insufficient for LLMs even before the emergence of inference-time scaling.
Current AI inference hardware primarily relies on two high-speed memory solutions: HBM (High Bandwidth Memory) and SRAM (Static Random-Access Memory). Each offers distinct advantages for different use cases.
Comparing memory technologies: DRAM, SRAM, HBM
HBM offers higher speed than DRAM while maintaining high capacity; SRAM provides much higher speeds but with lower capacity.
HBM (High Bandwidth Memory) achieves high-speed performance by vertically stacking multiple DRAM dies connected through through-silicon vias (TSVs). While this configuration enables faster data transfer compared to traditional DRAM, bandwidth remains limited by the relatively long distance between memory and logic dies. NVIDIA has widely adopted HBM in their GPUs, enabling substantial memory bandwidth improvements.
On the other hand, newcomers like Cerebras and Groq have embraced SRAM (Static Random-Access Memory). SRAM achieves extremely high speeds by embedding memory cells directly within the logic wafer, minimizing data movement between components. However, SRAM faces two key limitations: it has significantly lower density compared to DRAM (and HBM), and it must share precious die space with logic circuits. These constraints severely limit total capacity, forcing manufacturers to use multiple chips to achieve adequate memory capacity. For example, Groq's system requires 568 chips to run Llama 3.1 70B, while Cerebras's wafer-scale engine (CS-3) features 84 small dies across the wafer, providing 44GB of SRAM in total.
SRAM-based systems deliver impressive LLM inference speeds. However, their large physical footprint presents significant challenges for inference systems, where compact form factors are valuable. The total cost of ownership increases substantially, and the minimum investment required to run capable LLMs rises significantly. These architectures also face limitations in edge computing applications.
Neither solution achieves the combination of high capacity and high memory bandwidth required for the inference-time scaling era. A new solution is needed that can effectively balance both requirements.

3D-Stacked DRAM: Combining SRAM-like Speed with HBM-level Capacity

The MN-Core L1000 leverages 3D-Stacked DRAM technology to overcome the traditional tradeoff between memory capacity and bandwidth. Unlike HBM, which only stacks memory dies vertically, 3D-Stacked DRAM further stacks logic and memory wafers. This enables connections between compute units in the logic wafer and memory across the entire 2D plane, simultaneously achieving both SRAM's high connection density and HBM's large capacity.
3D-Stacked DRAM
3D-Stacked DRAM combines the best of both worlds: high memory capacity and high bandwidth.
With this architecture, we aim to deliver ultra-fast inference for a 70B+ LLM—in a workstation form factor. No data center or server room required. By achieving superior performance and memory density at lower cost, we target a 10x performance improvement over current solutions, perfectly positioning our technology for the era of long-context LLM inference.
System size vs LLM inference performance (Llama 3.1 70B)

Why Isn't 3D-Stacked DRAM More Widespread?

You might wonder why 3D-Stacked DRAM hasn't seen wider adoption. As an emerging technology, one of its primary challenges lies in thermal management. AI chips typically generate substantial heat, while DRAM is temperature-sensitive. However, we're confident in overcoming this challenge because our architecture is inherently energy-efficient, as demonstrated by our three-time Green500 victory (the ranking of the most energy-efficient supercomputers in the TOP500 list). This fundamental energy efficiency makes our approach to 3D-Stacked DRAM viable.
MN-Core architecture is energy-efficient
We plan to launch our first 3D-Stacked DRAM-based inference chip, the MN-Core L1000, in 2026.

Looking Beyond the Horizon

While our immediate focus is on our first inference chip, we have ambitious plans for future generations. We believe we can achieve 1T model inference on a single node with 128K context length at approximately 500 tokens per second in subsequent generations.

Contact Us

Accelerate your AI inference with L1000

Lime Board Image

© 2024 Preferred Networks. All rights reserved.