Michaela Blott

FINN- <i>R</i>

Michaela Blott, Thomas B. Preußer, Nicholas J. Fraser et al.|ACM Transactions on Reconfigurable Technology and Systems|2018

Cited by 400Open Access

Convolutional Neural Networks have rapidly become the most successful machine-learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations, and model parameters. The resulting scalability in performance, power efficiency, and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets, and a specific precision. We introduce formalizations of resource cost functions and performance predictions and elaborate on the optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS F1, demonstrating new unprecedented measured throughput at 50 TOp/s on AWS F1 and 5 TOp/s on embedded devices.

OSNT: open source network tester

Gianni Antichi, Muhammad Shahbaz, Yilong Geng et al.|IEEE Network|2014

Cited by 78

Despite network monitoring and testing being critical for computer networks, current solutions are both extremely expensive and inflexible. Into this lacuna we launch the Open Source Network Tester, a fully open source traffic generator and capture system. Our prototype implementation on the NetFPGA-10G supports 4 × 10 Gb/s traffic generation across all packet sizes, and traffic capture is supported up to 2 × 10Gb/s with naïve host software. Our system implementation provides methods for scaling and coordinating multiple generator/capture systems, and supports 6.25 ns timestamp resolution with clock drift and phase coordination maintained by GPS input. Additionally, our approach has demonstrated lower-cost than comparable commercial systems while achieving comparable levels of precision and accuracy; all within an open-source framework extensible with new features to support new applications, while permitting validation and review of the implementation.

Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware

David Sidler, Gustavo Alonso, Michaela Blott et al.|Unknown|2015

Cited by 77

TCP/IP is the predominant communication protocol in modern networks but also one of the most demanding. Consequently, TCP/IP offload is becoming increasingly popular with standard network interface cards. TCP/IP Offload Engines have also emerged for FPGAs, and are being offered by vendors such as Intilop, Fraunhofer HHI, PLDA and Dini Group. With the target application being high-frequency trading, these implementations focus on low latency and support a limited session count. However, many more applications beyond high-frequency trading can potentially be accelerated inside an FPGA once TCP with high session count is available inside the fabric. This way, a network-attached FPGA on ingress and egress to a CPU can accelerate functions such as encryption, compression, memcached and many others in addition to running the complete network stack. This paper introduces a novel architecture for a 10Gbps line-rate TCP/IP stack for FPGAs that can scale with the number of sessions and thereby addresses these new applications. We prototyped the design on a VC709 development board, demonstrating compatibility with existing network infrastructure, operating at full 10Gbps throughput full-duplex while supporting 10,000 sessions. Finally, the design has been described primarily using high-level synthesis, which accelerates development time and improves maintainability.

Achieving 10Gbps Line-rate Key-value Stores with FPGAs

Michaela Blott, Kimon Karras, Ling Liu et al.|TUbilio (Technical University of Darmstadt)|2013

Cited by 77

Distributed in-memory key-value stores such as mem-cached have become a critical middleware application within current web infrastructure. However, typical x86-based systems yield limited performance scalability and high power consumption as their architecture with its optimization for single thread performance is not well-matched towards the memory-intensive and parallel na-ture of this application. In this paper we present the design of a novel memcached architecture implemented on Field Programmable Gate Arrays (FPGAs) which is the first in literature to achieve 10Gbps line rate process-ing for all packet sizes. By transformation of the func-tionality into a dataflow architecture, the implementation can not only provide significant speed-up but also oper-ate at a lower power consumption than any x86. More specifically, with our prototype we have measured an in-crease of up to a factor of 36x in requests per second per Watt that can be serviced in comparison to the best published numbers for regular servers with optimized software. Additionally, we show that through the tight integration of network interface, memory and compute, round trip latency can be reduced down to below 4.5 mi-croseconds. 1

A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT)

John W. Lockwood, Adwait Gupte, Nishit Mehta et al.|Unknown|2012

Cited by 73

Current High-Frequency Trading (HFT) platforms are typically implemented in software on computers with high-performance network adapters. The high and unpredictable latency of these systems has led the trading world to explore alternative "hybrid" architectures with hardware acceleration. In this paper, we survey existing solutions and describe how FPGAs are being used in electronic trading to approach the goal of zero latency. We present an FPGA IP library which implements networking, I/O, memory interfaces and financial protocol parsers. The library provides pre-built infrastructure which accelerates the development and verification of new financial applications. We have developed an example financial application using the IP library on a custom 1U FPGA appliance. The application sustains 10Gb/s Ethernet line rate with a fixed end-to-end latency of 1μs - up to two orders of magnitude lower than comparable software implementations.

Is this you? Claim your profile.

Top publicationsby citations