Just appeared on Weebit's web site
https://www.weebit-nano.com/innovative-memory-architectures-for-ai/Innovative Memory Architectures for AIGideon Intrater One of the biggest trends in the industry today is the shift towards AI computing at the edge. For many years the expectation was that the huge datacenters on the cloud would be the ones performing all the AI tasks, and the edge devices would only collect the raw data and send it to the cloud, potentially receiving the end directives after the analysis was done.
More recently, however, it has become more and more evident that this can’t work. While the learning task is a strong fit for the cloud, performing inference on the cloud is less optimal.
With the promise of lower latency, lower power and better security, we are seeing AI inference in a growing number of edge applications, from IoT and smart home devices all the way up to critical
applications like automotive, medical, and aerospace and defense.
Since edge devices are often small, battery-powered, and resource-constrained, edge AI computing resources must be low-power, high-performance, and low-cost. This is a challenge considering power-hungry AI workloads, which must rely on the storage of large amounts of data in memory and the ability to quickly access it. Some models have millions of parameters (e.g., weights and biases), which must be continually read from memory for processing. This creates a fundamental challenge in terms of power consumption and latency in computing hardware.
Data movement is a key contributor to power consumption. Within chips, significant power is consumed while accessing the memory arrays in which the data is stored and while transferring the data over the on-
chipinterconnect. The memory access and speed of the interconnect also contribute to latency, which limits the speed of the AI computation. Speed and power both get significantly worse when the data needs to be moved between two separate chips.
To keep edge computing resources low-power and low-latency, hardware must be designed so that memory is as close as possible to the computing resources.
The continuous move to smaller process geometries has helped to keep power consumption to a minimum and has also reduced latency for AI tasks. But while computing resources continually scale to more advanced nodes,
Flash memory hasn’t been able to keep pace. Because of this, it isn’t possible to integrate Flash and an AI inference engine in a single
SoC at 28nm and below for edge AI.
Today it’s standard practice to implement a two-die solution, where one die at an advanced process node is used for computing, and another die at a more mature process node is used for memory. The two dies are then integrated in a single package or in two separate packages. A two-die solution is detrimental to AI performance because memory resides far away from compute, creating high levels of power consumption, latency and total system costs.
The ideal solution is a single die that hosts memory and compute, and embedded
ReRAM (or
RRAM) is the logical
NVM to use. Embedding ReRAM into an AI SoC would replace off-chip flash devices, and it can also be used to replace the large on-chip
SRAM to store the AI weights and
CPU firmware. Because ReRAM is non-volatile, there is no need to wait at boot time to load the AI model from external NVM.
Such ReRAM-based chips use less power and have lower latency than two-chip solutions. And, with a wider path to memory enabled by the width of the interface, the memory interface is no longer limited by the number of pins on the memory device. The result is faster access time, faster inference, and the potential for true real-time AI computing at the edge.
ReRAM is also much denser than SRAM which makes it less expensive than SRAM per
bit, so more memory can be integrated on-chip to support larger neural networks for the same
die size and cost. While on-chip SRAM will still be used for data storage, the array will be smaller and the total solution more cost-effective.
Finally, ReRAM-enabled chipsets are cheaper to manufacture, since they only require the fabrication of one die. This makes edge computing more affordable and consequently more accessible for a large array of
applications.
Above: how the various memory technologies compareYou can see
here the slides on this topic I presented at the Design Automation Conference (
DAC) 2024.