Of the world’s most powerful supercomputers, nine of the top 10 are powered by GPUs, but that might not be the case for much longer. As chipmakers like Nvidia prioritize AI FLOPS over the ultra-precise floating point calculations used in scientific computing, US National Labs are turning to new chip architectures to get their FP64 fix. Among the candidates is NextSilicon’s Maverick-2, a dataflow processor designed explicitly with the 64-bit floating point mathematics that dominate the Department of Energy’s most important simulations. Despite its name, the Department of Energy is concerned with far more than the US’ power grid. It operates some of the largest publicly known supercomputers in the world, which are responsible for everything from simulating the physics of nuclear weapons at the moment of criticality and bioweapons defense to public health and safety. Since the Titan Supercomputer made its debut in 2012, a growing number of these supercomputers have been powered by GPUs from Nvidia, and more recently AMD. But that’s not the case for Sandia National Laboratory’s new Spectra supercomputer, which was built in collaboration with Penguin Solutions and NextSilicon. Compared to exascale systems like Frontier or El Capitan, Spectra is tiny. The machine counts 64 nodes and 128 of NextSilicon’s “runtime-configurable” accelerators. But scale isn’t the point. Spectra is a test bed for NextSilicon’s Maverick-2. This week, Sandia gave the chips the thumbs up, announcing that the big iron had met all of its system acceptance requirements, opening the door for the chips to be deployed in larger systems in the future. Not another GPU Despite some similarities to Nvidia’s B200, Maverick-2 is a very different beast. Instead of the standard von Neumann compute architecture that underpins most CPUs and GPUs today, NextSilicon’s chips employ a reconfigurable dataflow architecture. The processor’s two compute dies comprise a grid of arithmetic logic units interconnected in a graph. Each unit is configured at runtime to perform a specific operation, whether it be addition, multiplication, or some other logic operation. But the chip’s real trick is overlapping data flow and compute. As soon as data reaches the next unit in the pipeline, it’s computed immediately, no waiting for load-store operations to shuffle data around. According to NextSilicon, this dramatically improves the performance and efficiency of the chips in real-world workloads. Dataflow architectures aren’t new. Groq, Cerebras, and SambaNova have all built chips based on the concept. However, all of these designs are aimed at AI inference or training. NextSilicon’s is one of the few we’ve seen aimed at HPC. Dataflow is notoriously difficult to program for, which is likely why the chip startups that have built chips around it have largely offered them as a managed or white glove service rather than selling bare metal servers. Rather than trying to port workloads to run on its chips, NextSilicon has built a compiler that it claims allows it to run any existing C, Python, Fortran, or CUDA codebases on its chips. As we understand it, it works by initially running these workloads on the CPU. The compiler then captures the compute graph, maps it to the chips, and then optimizes it to maximize performance. With Spectra, Sandia has now validated the parts across three key workloads: the high-performance conjugate gradient (HPCG) benchmark, the LAMMPS molecular dynamics test suite, and the Sparta Monte Carlo simulation suite. AI is changing GPUs NextSilicon’s focus on HPC comes in stark contrast to the next generation of GPUs from Nvidia. The company’s Rubin GPUs due out later this year promise gobs of memory bandwidth and up to 50 petaFLOPS of FP4 compute. This makes the chips strong contenders for AI inference and training workloads, which is probably why the DoE is also deploying them in systems like the Doudna supercomputer at Lawrence Berkeley National Laboratory. While FP64 compute remains relevant for many existing scientific workloads, for AI workloads, Nvidia's GPUs are still relevant to US Labs. However, all those AI FLOPS come at the expense of hardware FP64 vector and matrix performance. Rubin tops out at 33 teraFLOPS, making it slower than even Nvidia’s nearly four-year-old H100. But that’s not to say it’s not good for scientific computing. For matrix heavy workloads like High Performance Linpack (HPL), Nvidia is leaning on a somewhat controversial spin on the Ozaki scheme, which uses lower precision data types to emulate FP64 compute. Using this approach, Nvidia claims Rubin can deliver up to 200 teraFLOPS of FP64 matrix performance. We dug deeper into Nvidia’s emulated FP64 algorithms earlier this year, but suffice to say it’s not perfect. While it has shown promise in certain HPC workloads, in others, particularly vector-heavy ones, like computational fluid dynamics, it offers little if any benefit. Coincidentally, the latter happens to be the same kind of workload that NextSilicon has focused its attention on. We don’t yet have system-level benchmarks for NextSilicon’s hardware, much less Spectra, but we’re told a single Maverick-2 can deliver about 600 gigaFLOPS of FP64 compute HPCG. The startup claims this performance is roughly on par with leading GPUs while consuming half the power. While Nvidia is clearly prioritizing AI compute in its latest generation of GPUs, AMD has taken a different approach. Like Rubin, AMD’s new MI455X accelerators are tuned for AI inference and training, but it’s only one of several versions of the GPU the House of Zen has baked in TSMC’s oven. For the MI430X, AMD swapped out the AI-centric compute dies for some built specifically for HPC. Earlier this month, we learned the chip would deliver up to 200 teraFLOPS of peak FP64 grunt to the DoE’s upcoming Discovery and Europe's Alice Recoque supercomputers. Who needs GPUs anyway? Chip startups like NextSilicon still need to prove their chips can scale to larger systems. But, across the Pacific, China has already shown that, at least for scientific computing, it doesn’t need GPUs to compete with the West’s best supers. China has a history of building boutique silicon specifically to advance its national supercomputing capability. Some systems, like the Sunway TaihuLight supercomputer, used a custom manycore processor like 260 custom RISC processors. Others, like the Tianhe 2A, used a homegrown digital signal processor (DSP) called the Matrix 2000 for its FP64 compute. More recently, we caught wind of a new supercomputer, called the LineShine, that, similar to the TaihuLight machine, reportedly uses 47,000 custom CPUs, which are expected to push the machine to 2 exaFLOPS of FP64 grunt. Of course, because China doesn’t participate in the annual Top500 ranking of the fastest publicly known supers anymore, we may never know for sure. China’s use of boutique silicon is due in part to US trade restrictions on the sale of high-end accelerators in the region. Even where still legal, these chips have become a supply chain vulnerability for Beijing. In fact, the US government’s decision to bar Intel from selling its Xeon Phi processors to China drove the development of the Matrix 2000. In the US, the bigger challenge may be competing with chip designers' shareholders. AI has made Nvidia the most valuable company in the world; HPC by comparison remains an important, albeit niche market. ®