A Configurable Cloud-Scale DNN Processor for Real-Time AI

Jeremy Fowers; Kalin Ovtcharov; Michael Papamichael; Todd Massengill; Ming Liu; Daniel Lo; Shlomi Alkalay; Michael Haselman; Logan Adams; Mahdi Ghandi; Stephen Heil; Prerak Patel; Adam Sapek; Gabriel Weisz; Lisa Woods; Sitaram Lanka; Steven K. Reinhardt; Adrian M. Caulfield; Eric S. Chung; Doug Burger

doi:10.1109/isca.2018.00012

A Configurable Cloud-Scale DNN Processor for Real-Time AI

Jeremy Fowers(Microsoft Research (United Kingdom)), Kalin Ovtcharov(Microsoft Research (United Kingdom)), Michael Papamichael(Microsoft Research (United Kingdom)), Todd Massengill(Microsoft Research (United Kingdom)), Ming Liu(Microsoft Research (United Kingdom)), Daniel Lo(Microsoft Research (United Kingdom)), Shlomi Alkalay(Microsoft Research (United Kingdom)), Michael Haselman(Microsoft Research (United Kingdom)), Logan Adams(Microsoft Research (United Kingdom)), Mahdi Ghandi(Microsoft Research (United Kingdom)), Stephen Heil(Microsoft Research (United Kingdom)), Prerak Patel(Microsoft Research (United Kingdom)), Adam Sapek(Microsoft Research (United Kingdom)), Gabriel Weisz(Microsoft Research (United Kingdom)), Lisa Woods(Microsoft Research (United Kingdom)), Sitaram Lanka(Microsoft Research (United Kingdom)), Steven K. Reinhardt(Microsoft Research (United Kingdom)), Adrian M. Caulfield(Microsoft Research (United Kingdom)), Eric S. Chung(Microsoft Research (United Kingdom)), Doug Burger(Microsoft Research (United Kingdom))

Unknown

June 1, 2018

10.1109/isca.2018.00012

Cited by 524

Abstract

Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models-aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

Related Papers

No related papers found

Powered by citation graph analysis