Photo by Sofia Vila Flor on Unsplash

AI heralds the return of monolithic architectures

We know the mainframe as a relic from the past frequently tended too by long retired pensioners. The personification of obsolete technologies forever confined to the museum of rarities. Aside from the fact that the mainframe is still the most efficient hardware for what it was designed to do, ultra reliable, hig volume transaction processing, the main thinking behind mainframe architecture is making a surprising comeback to solve the biggest problems of AI. 

Mainframes have for decades been ridiculed for its ultra tight coupling of compute, memory and I/O constituting monolithic systems. As an architecture style decoupling has been the main thrust in hardware as well software architectures: modular and flexible architectures have prevailed across the board. It turns out, it comes with a price: performance, latency, throughput and energy consumption. These are trivial when you have to store the Purchase Order in a database, or present all the messages of a social network user. Running AI inference, not so much. 

One of the main issues in AI processing are derived from exactly this modularisation, which has permeated modern computing, in particular the memory bottleneck. To explain, let me show you a diagram from my book Cloud Computing Basics to give a general sense (notice that GPUs do not work exactly like CPUs because they are parallelised, but the problem is more or less the same):

Data goes into the central memory, is processed by the CPU (or GPU) and then needs to dynamically store or retrieve data before it produces an output. When the memory is eternal on another piece of hardware, the data needs to move through wires. Done at the massive scale that AI inference requires, all extrachip activity is taxing in terms of time and energy. The mainframe tightly coupled all these components, but modern computing decoupled them. Coupling is back to save the day for AI. This is done in different ways.

Tighter integration between compute and memory: Invidia’s newest GPUs (such as B1and B2) use high bandwidth memory (HBM) closer to the processor by stacking memory in a 3 dimensional form to create high-performance computing. Apple similarly accelerates processing by moving memory and compute closer in their Neural Engines accelerators in their chips in iPhones to do the heavy lifting of Siri and Face ID. 

Better balance of compute and I/O: Another solution to the same problem that has nothing to do with the mainframe design as such is the concept of a Systolic Array, which balances compute and I/O where data moves without being persisted in memory. It is thus opposed to the classical Von Neumann Architecture of most modern computing. This maximises the computation done on a single piece of data.

This approach was first championed by Google in their Tensor Processing Units (TPU) which is an application specific integrated circuit that helps power Search, Translate and Gemini. It also forms the basic principle of the splittable systolic array of MatX, a chip start up that promises to make chips that produce the highest throughput with lowest latency. That is taking the mainframe philosophy to the extreme, to integrate memory and compute so tightly so as to make them flow as one. 

Bruteforce to minimise memory need: a third approach is just to integrate cores at ridiculous scale like Cerebras, which has built the world’s biggest single chip with 9.000 cores thereby distributing memory across the chip. This reduces the need for fetching and storing off chip memory. Where systolic arrays optimize how data moves, this reduces the need for the data to move at all.

These examples show that ideas are not bad only because they are old. loose coupling has led to many desirable effects and is often the best option, but it does not mean that it is bad. 

All architectures whether hardware, software or physical ones have to be assessed against the job they were meant to perform. Sometimes the need arises for something similar in the most surprising contexts. It is therefore important to keep an open mind when it comes to architecture and assess the real strenghts and weaknesses and understand why it was built as it was, rather than classify some architectures as good and some as bad. All have tradeoffs. It also shows us that there is no “best” architecture. The best architecture will always depend on the situation at hand and the options available. 

Photo by Sofia Vila Flor on Unsplash


Posted

in

by

Tags: