top of page

The end of the line for single-chip processors?


Apple has once again surprised enthusiasts and analysts with the release of the M1 Ultra. That’s because this chip is a variant of the M1 Max that effectively fuses two chips into one, allowing dual-chip designs to be treated as a single piece of silicon by software.

Nvidia made a similar announcement at the 2022 GPU Technology Conference when the company’s CEO Jensen Huang announced that the company would fuse the company’s two new Grace CPU processors into a single “superchip.

These announcements are aimed at different markets.

Apple is set on consumer and professional workstations, while Nvidia intends to compete in the high-performance computing space. However, the divergence in purpose only underscores the broader challenge of quickly ending the era of single-chip design.

Multi-chip design is nothing new, but the idea has rapidly gained popularity over the past five years. AMD, apple, intel, and Nvidia are all of varying degrees. AMD is pursuing small-chip designs with its epic and rYZen processors. Intel plans to follow the lead of Sapphire Rapids, an upcoming architecture for the server market built on a small chip it calls a “tile. Now Apple and Nvidia are joining the bandwagon — although their designs are aimed at very different markets.


Nvidia’s Grace CPU Superchips

The challenges of modern chip manufacturing are driving a shift to multi-chip designs. The miniaturization of transistors has slowed, but the growth of transistor counts in leading-edge designs shows no sign of slowing.

Apple’s M1 Ultra has 114 billion transistors and a chip area (or manufacturing area) of about 860 square millimeters (official data for the M1 Ultra is unavailable, but a single M1 Max chip has a chip area of 432 mm²).

The transistor count for Nvidia’s Grace CPU is still under wraps, but the Hopper H100 GPU announced with the Grace CPU includes 80 billion transistors. To put that in perspective, AMD’s 2019 release of the 64-core EYPC Rome processor has 39.5 billion transistors.

Transistors push this highly driven modern chip production to the extreme, making multi-chip designs even more attractive,” said Akshara Bassi, research analyst at Counterpoint. “Multi-chip module packaging allows chipmakers to deliver better power efficiency and performance in single-chip designs as die sizes become larger and wafer yield issues become more prominent.”

From the current market state, except for Cerebral (a startup trying to build chips that span the entire silicon wafer), the chip industry seems to agree that monolithic designs are becoming more trouble than it’s worth.

This shift to more minor chips is happening in tandem with support from manufacturers. TSMC is the leader, offering a set of advanced packages called 3DFabric. AMD uses technology belonging to 3DFabric in some EPYC and RYZEN processor designs, and almost certainly, Apple uses it for the M1 Ultra (Apple has not confirmed this, but TSMC produces the M1 Ultra). Intel has its packaging technologies, such as EMIB and Foveros. While initially intended for Intel’s use, the company’s chip manufacturing technologies are becoming relevant to a broader range of industries as Intel Foundry Services opens up.

“The ecosystem around basic semiconductor design, manufacturing and packaging has evolved to support design nodes to economically and reliably produce small-chip-based solutions,” Mark Nossokoff, senior analyst at Hyperion Research, said in an email. “Software design tools that seamlessly integrate various electronic component small-chip functions have also matured to optimize the performance of the target solution.”

Chiplets are here to stay, but it’s a siloed world for now. AMD, Apple, Intel, and Nvidia use their interconnect designs for specific packaging technologies.

Universal Chiplet Interconnection Express wants to bring the industry together. Announced on March 2, 2022, the open standard offers a “standard” 2D package for “cost effective performance” and an “advanced” package for leading-edge designs. UCI also supports out-of-package connectivity via PCIe and CXL, opening up the potential for connecting multiple chips across multiple machines in high-performance computing environments.

Examples of UCI packaging options from the UCI white paper


Examples of UCI packaging options from the UCI white paper

UCI is a start, but the future of the standard remains to be seen. “The founding members of the original UCIe founders represent a wide range of distinguished contributors to technology design and manufacturing, including the HPC ecosystem,” Nossokoff said, “but there are many major industry organizations that have yet to join, including Apple, AWS Broadcom, IBM, NVIDIA, and other silicon foundries and memory providers.”

Bassi noted that NVIDIA might be particularly reluctant to participate. The company has opened up its NVLink-C2C interconnect for custom silicon integration, making it a potential competitor to UCI.

But while the fate of interconnects like UCIe and NVLink-C2C will be game-changing, they are unlikely to change the game being played.

Apple’s M1 Ultra can be seen as the canary in the coal mine. Multi-chip design is no longer limited to the data centers showing up on a home computer near you.

Three approaches to the 3D chip

For several years now, system-on-a-chip developers have been breaking down their increasingly large designs into smaller, more minor chips and linking them together in the same package to effectively increase silicon area and other advantages. In CPUs, most of these links are so-called 2.5D, where small chips are set side-by-side and connected using short, dense interconnects. As most major manufacturers have agreed on a standard for 2.5D small-chip-to-small-chip communications, the momentum for such integration will only grow.

But to get large amounts of data out like on the same chip, you need shorter, denser connections, which can only be achieved by stacking one chip on top of another. Connecting two chips face-to-face could mean thousands of connections per square millimeter.

It takes a lot of innovation to make it work. Engineers must figure out how to prevent the heat from one chip in the stack from killing the other, decide which features should go where and how they should be made, prevent the occasional bad little chip from causing a lot of expensive dumb systems, and deal with the attendant complexity of solving all of these problems at once.

Here are three examples, ranging from the relatively simple to the bewilderingly complex, showing where 3D stacking is now.

AMD’s 3D V-Cache technology connects a 64 megabyte SRAM cache [red] and two blank fabric small chips to a Zen 3 compute small chip.


AMD’s Zen 3

PCs have long offered the option of adding more memory to provide faster speeds for ultra-large applications and data-heavy work. Thanks to 3D chip stacking, AMD’s next-generation CPU small chips also offer that option. Of course, this is not an aftermarket add-on, but if you are looking for a computer with more appeal, then ordering a processor with a large cache of memory may be the way to go.

Although both the Zen 2 and new Zen 3 processor cores are built using the same TSMC manufacturing process — and therefore have identical size transistors, interconnects, and everything else — AMD has made so many architectural changes that it allows them to make even without the extra This allows them to deliver an average performance increase of 19% on Zen 3 even without the additional cache. One of the architectural gems is the inclusion of a set of silicon via vias (TSVs), vertical interconnects that run directly through most of the silicon. The TSVs are built into Zen 3’s top-level cache, an SRAM block called L3, which sits in the middle of the compute niche and is shared among all eight of its cores.

In processors used for data-heavy workloads, the backside of the Zen 3 wafer is thinned until the TSV is exposed. A small 64-megabyte SRAM chip is then bonded to those exposed TSVs using hybrid bonding — a process similar to cold-soldering copper together. The result is dense connections that can be as tight as 9 microns. Finally, additional blank silicon chips are attached for structural stability and thermal conduction to cover the rest of the Zen 3 CPU chip.

Adding additional memory by setting it next to the CPU chip is not an option, as it takes a long time for data to reach the processor core. “Despite the tripling of the L3 [cache] size, the 3D V-Cache only adds four [clock] cycles of latency — which can only be achieved with 3D stacking,” said John Wuu, senior design engineer at AMD.

Enormous cache has a place in high-end gaming. A desktop Riptide CPU and 3D V-Cache can increase game speed by an average of 15 percent at 1080p. It also works for more serious work, reducing the run time of complicated semiconductor design calculations by 66 percent.

Wu notes that the industry’s ability to shrink SRAM is slowing down compared to its ability to shrink logic. As a result, you can expect future SRAM expansion kits to continue to be built using more mature manufacturing processes while compute chips are pushed to the forefront of Moore’s Law.

Graphcore’s Bow AI gas pedal uses 3D chip stacking to increase performance by 40%.


Graphcore’s Bow AI Processor

3D integration can speed up computation even when there is no single transistor on a chip in the stack. UK-based AI computer company Graphcore has dramatically increased the performance of its systems simply by putting powered silicon on its AI processors. Adding powered silicon means that the combined chip, called Bow, can run faster (1.85 GHz versus 1.35 GHz) and at a lower voltage than its predecessor. Compared to its predecessor, the computer can train neural networks 40% faster and consume 16% less energy. Notably, users do not need to change their software to get this improvement.

The power management chip consists of a combination of capacitors and silicon vias. The latter provides power and data to the processor chip. The real difference is the capacitor. Like the bit storage components in DRAM, these capacitors are formed in deep, narrow trenches in the silicon. Because these charge stores are so close to the processor’s transistors, power transfer is smoothed out, allowing the processor core to run faster at lower voltages. Without the power supply chip, the processor would have to increase its operating voltage above its nominal level to operate at 1.85 GHz, consuming more power. With a power supply chip, it can also reach that clock frequency and consume less power.

The manufacturing process used to make BoW is unique but unlikely to stay that way. Most 3D stacking is done by bonding one small chip to another while one remains on the wafer, called a chip on a wafer [see “AMD’s Zen 3” above]. Instead, Bow uses TSMC’s wafer-to-wafer, in which an entire wafer of one type is bonded to an entire wafer of another type and then cut into chips, says Graphcore CTO Simon Knowles, adding that this is the first chip on the market to use the technology, which allows a higher density of connection between two die than can be achieved using the chip-on-a-wafer process.

Although small powered chips don’t have transistors, they may be coming, Knowles said, adding that using the technology only for power “is just a first step for us. “Shortly, it will go much further.”

Intel’s Ponte Vecchio processor integrates 47 small chips into a single processor.


Intel’s Ponte Vecchio supercomputer chip

The Aurora supercomputer is designed to be one of the first U.S. high-performance computers (HPCs) to break the exaflop barrier — performing one billion high-precision floating-point calculations per second. Intel’s Ponte Vecchio packaged more than 100 billion transistors on 47 silicon wafers into a single processor to get Aurora to these heights. Using 2.5D and 3D technologies, Intel compressed 3,100 square millimeters of silicon (almost equal to four Nvidia A100 GPUs) into a footprint of 2,330 square millimeters.

Intel researcher Wilfred Gomes told engineers attending the IEEE International Solid-State Circuits Conference that the processor pushes the limits of Intel’s 2D and 3D small-chip integration technology.

Each Ponte Vecchio is a set of two mirrored small chips bundled together using Intel’s 2.5D integration technology, Co-EMIB, which forms a high-density interconnect bridge between two 3D small-chip stacks. The bridge itself is a small piece of silicon embedded in a packaged organic substrate. The density of interconnects on the silicon can be twice as high as on the organic substrate.

The Co-EMIB core also connects high-bandwidth memory and I/O small chips to the “base block,” which is the largest small chip in the rest of the stack.

The base tile uses Intel’s 3D stacking technology, Foveros, on which compute and cache small chips are stacked. The technology creates a dense array of chip-to-chip vertical connections between two chips. These connections can be up to 36 microns, except for short copper posts and solder micro-bumps. Signals and power enter this stack through silicon vias, and a reasonably wide vertical interconnect runs directly through most of the silicon.

Eight compute tiles, four cache tiles, and eight blank “hot” tiles for cooling from the processor are connected to the base tile, which provides cache memory and a network that allows any compute block to access any memory.

This was not easy, says Gomes, who innovated in yield management, clock circuitry, thermal regulation, and power delivery. For example, Intel engineers chose to supply the processor with a higher-than-normal voltage (1.8 volts) so that the current was low enough to simplify the package. The circuitry in the base block reduces the voltage to close to 0.7 V for use in the compute blocks, and each compute block must have its power domain in the base block. The key to this capability is new high-efficiency inductors called coaxial magnetic integrated inductors. Because these are built into the package substrate, the circuitry snakes back and forth between the base block and the package before providing voltage to the compute block.

Gomes said it took a full 14 years to go from the first petaflop supercomputer in 2008 to this year’s exaflops machine. But advanced packaging, such as 3D stacking, is one of the technologies that could help cut the next thousand-fold computing improvement to just six years, Gomes told engineers.

1 view0 comments

Recent Posts

See All

What is a Voltage Regulator?

A voltage regulatory authority is a component that transforms a voltage to a lower (or higher) level. A case in point is if you wish to...

What is Voltage monitoring chip ?

Voltage monitoring chip On some occasions of unstable voltage, the short rise and fall of the voltage will lead to confusion in the...

Comentários


Subscribe to Our Newsletter

Thanks for submitting!

bottom of page