WRC 2022

Using the FABulous eFPGA Framework for Implementing AI Systems

Speaker: Dirk Koch - Heidelberg University

Abstract: AI is a key driver in the electronic industry and technology is progressing that fast that hardware reconfiguration is virtually the only option to allow systems taking advantage of technology improvements in the field. Moreover, reconfiguration allows the same chip to adapt to different problem domains and corresponding architectural features of the ML model. While there are many powerful tools available for mapping ML models onto FPGAs, this talk looks into defining the FPGA fabrics themselves. These fabrics can then be integrated as embedded FPGAs into SoCs to provide the performance and energy efficiency of an accelerator with the opportunity to update the fabric in the field. The talk will detail the differences between embedded FPGAs and dedicated general-purpose devices. Based on this, we will introduce the FABulous embedded FPGA framework with respect to defining architectures tailored to ML applications. This includes customization of the primitives (like arithmetic blocks and memories), the routing fabric, the exact fabric layout (in terms if the number of primitives, and the size and shape of the fabric) as well as the fabric interfaces. Furthermore, we will briefly cover aspects related to an ASIC integration into an SoC, like performance and power characterization. As an outlook, we will sketch how the FABulous framework may be integrated into a co-design framework for building domain specific AI acceleration engines.

Short bio

Dirk Koch is a Reader in the Advanced Processor Technologies Group at the University of Manchester. His main research interests are on run-time reconfigurable systems based on FPGAs, embedded systems, computer architecture, VLSI and hardware security. Dirk developed techniques and tools for self-adaptive distributed embedded control systems based on FPGAs. Current research projects include database acceleration using FPGAs-based stream processing, HPC and exascale computing, as well as reconfigurable instruction set extensions for CPUs and using FPGAs. Moreover, his group is maintain the FABulous open source FPGA framework. Dirk Koch is author of the book "Partial Reconfiguration on FPGAs" and a co-editor of the book "FPGAs for Software Programmers".

Compiling and Operating Dynamic Stream Processing Pipelines on FPGAs

Speaker: Kaspar Matas - University of Manchester

Abstract: FPGAs provide both high throughput and energy efficiency with stream processing pipelines. However, creating efficient pipelines is challenging for problems having data-dependent producer-consumer rates or a variable number of processing stages, like in database acceleration. Current static acceleration approaches cease to be effective in such dynamic scenarios, and systems are usually bound to only accelerating fixed operations like compression and filtering. This PhD project proposes an open-source framework for orchestrating the composition of dynamically stitched stream processing accelerator pipelines. This automatic module stitching is transparent to the user, and the pipelines for the FPGA are compiled and managed by our runtime system. The software handles the splitting, merging, and joining of the data streams between the modules by scheduling different partially reconfigurable accelerator modules on the fabric. This uses a library of operator accelerators where operators can be implemented with different resource and performance tradeoffs to prevent an over-provisioning of the FPGA resources. These operators can be created in either HLS or HDL, and as long as they use the system's API, our middleware can use partial reconfiguration to compose the acceleration pipeline and to serve the user requests. This PhD project is currently developing a demonstrator for database acceleration that can compile SQL queries into bitstreams. With this, we want to make better use of the FPGA resources in dynamic scenarios where the queries are only known at runtime. Moreover, our system will simplify the integration and operation of such systems by providing a standardized runtime system.

Short bio

Kaspar Matas is a PhD student in the Advanced Processor Technologies Research Group (APT) at the Department of Computer Science, the University of Manchester. His PhD topic is “Transparent Integration of a Dynamic FPGA Database Acceleration System” where he works on a framework to dynamically combine and schedule database operations on a data streaming FPGA accelerator. He has contributed to multiple open-source projects, including FPGADefender and OrkhestraFPGAStream. His research interests include reconfigurable computing, hardware-software co-design, computer architecture, hardware security, and software engineering.

Distributed Large-Scale Graph-Processing on FPGAs

Speaker: Roberto Giorgi - University of Siena

Abstract: Processing large-scale graphs is a challenging concept due to the nature of the computation that causes irregular memory accesses. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs, thus recent research trends propose graph processing acceleration with FPGAs. Moreover, in the case of large-scale graph processing, one major problem is that the graph does not fit into the limited amount of on-chip memory resources available on a modern FPGA. Due to the limited capacity of device memory, data would be transferred to and from the FPGA multiple times during the computation and can lead to a long execution time. To maximize performance, it is necessary to overlap, hide and customize the data transfers to the highest degree, so that the FPGA accelerator is always fully loaded. A possible way to overcome the limited resources on one FPGA accelerator is to develop a distributed architecture on a multi-FPGA platform using an efficient partitioning scheme. An efficient partitioning scheme aims to increase data locality, minimising communication between the partitions. In this work, we use an offline partitioning method to support the distributed large-scale graph processing concept. Our architecture uses Hadoop at the higher level to map a graph to the underlying hardware. The higher layer of computation is responsible to gather the blocks of data that have been pre-processed and stored on the host’s file system and distribute them to a lower layer of computation made of FPGAs. In this work, we show how graph partitioning combined with a multi-FPGA architecture will lead to high performance without limitation on the size of the graph, even when the graph has trillions of vertices. Our performance analysis, in the case of PageRank, forecasts performance improvement of up to 24 times and a cost-normalized improvement of up to 14.5 times when comparing the proposed approach on one Xilinx Alveo U250 FPGA accelerator against a state-of-the-art highly-optimized graph processing software implementation on a high-end CPU like a 32-core processor at 2.2 GHz.

Ultra-low Power Computing with CGRAs: an architecture, compilation, and application triptych

Speaker: Kevin Martin - Université Bretagne-Sud

Abstract: Coarse-grained reconfigurable architectures (CGRAs) are ideal computing devices as they provide both flexibility and performance. In this talk, we present a three-part approach that addresses architecture, compilation, and application to reach ultra-low power computing with CGRAs. First, we present a general-purpose Integrated Programmable-Array accelerator (IPA) exploiting a novel architecture, execution model, and compilation flow for application mapping that can handle kernels containing complex control flow, without the significant energy overhead incurred by state of the art approaches. We then present modifications applied at the application level to support transprecision computing, a variable size floating-point computing to adjust the precision of the results to the application needs. Finally, we present the SIMD and transprecision support in the CGRA. This global approach reaches an average of 10x energy improvement compared to a RISC-V based ultra-low power digital signal processor.

Short bio

Kevin Martin is an associate professor at Université Bretagne-Sud, France, in the Lab-STICC, since 2011. He has received a M.S. in electrical and computer engineering in 2004 and a PhD in computer science in 2010 from the Université de Rennes, France. His research interests include system-level design and methodologies, custom processors, embedded multi-processor platforms, CGRAs, high-level synthesis, computer-aided design for SoCs and embedded systems.

The role of FPGAs for enabling onboard AI in space applications

Speaker: George Lentaris - National Technical University of Athens

Abstract: The success of AI/ML in terrestrial applications and the commercialization of space are now paving the way for the advent of AI/ML also in Low Earth Orbit satellites. The two most important hurdles in this direction are the reliability of AI algorithms and the processing power of classical space processors. To overcome the latter, the community considers extending the use of FPGA in space, either with space-grade or Commercial-Off-The-Shelf devices. The FPGA capabilities can be complemented with VPU or TPU co-processors to further enhance high-level AI development and in-flight reconfiguration in space. Thus, selecting the most suitable FPGA devices and designing the most efficient avionics architecture becomes critical for the success of novel AI space missions. The current work presents industrial trends and future ideas, as well as in-house benchmarking and architectural designs utilizing FPGAs to enable AI in space applications.

Short bio

Dr George Lentaris is a senior research associate at Microlab, National Technical University of Athens (NTUA/Greece), working on high performance embedded computing. His work includes HW/SW co-design on single-/multi-/SoC-FPGA and DSP platforms to accelerate a variety of computer vision and telecommunication algorithms, as well as process variability and reliability of FPGA devices (including radiation testing). He holds a PhD in Computing from the National & Kapodistrian University of Athens/Greece ("Parallel Architectures and Algorithms for Digital Signal and Image Processing", NKUA, 2011), as well as two MSc degrees in "Logic, Algorithms, and Computation" and in "Electronic Automation", with a BSc in Physics. In the past decade, he has published >40 papers in the aforementioned domains and has participated in >10 European projects, both for the European Space Agency and for H2020. He also serves as a regular reviewer for international journals and as a teaching associate at NKUA and UNIWA, Greece.

DASS: An Automated HLS Tool that Combines Dynamic & Static Scheduling

Speaker: Jianyi Cheng - Imperial College London

Abstract: A central task in high-level synthesis is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile-time, but recent years have seen the emergence of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has non-trivial control flow. This talk introduces an open-sourced HLS tool named DASS by us that automatically combines the best of both worlds.

Short bio

Jianyi Cheng is a Ph.D. candidate at Imperial College London, supervised by Prof. George A. Constantinides and Dr. John Wickerson. His research aims to produce smaller and faster hardware using formal methods. His current work is mainly focused on high-level synthesis (HLS) tool optimisation, including DASS, Dynamatic and Vivado HLS. His research interests include hardware programming, programming language, static analysis, formal verification and probabilistic programming.

Optimizing Open Source Toolchain for FPGA bitstream generation

Speaker: Martin Margala - University of Louisiana at Lafayette

Abstract: The flexibility, high performance and power efficiency of Field Programmable Gate Arrays (FPGAs) had resulted in greater ubiquity in both cloud and edge environments. However, the existing state-of-the-art vendor tooling for FPGA bitstream generation lacks a number of features that are critical for high productivity, which in turn results in long turnaround times and substantially limits the manner in which FPGAs can be used. Since this tooling is also closed source, it cannot be modified to incorporate additional functionality. While there are a number of open source alternatives, these tools currently only deliver a fraction of the hardware quality compared to vendor tool solutions - thus making their use impractical for most workloads. Our work is aimed at closing this gap between open-source and vendor toolchain for FPGA bitstream generation. This presentation will provide a project overview with results in analyzing inherent biases of existing tools, building a synthetic benchmark set which can be used to identify and analyze policy decisions made by tools that impact generated hardware quality, determining bottlenecks or suboptimal policies in open-source tools. Finally, presentation will show initial results of an optimization of the identified policies which can be done manually or through reinforcement learning.

Short bio

Martin Margala is currently Director of the School of Computing and Informatics at the University of Louisiana at Lafayette, a flagship of the University of Louisiana System. Previously he was for 10 years the Department Head of Electrical and Computer Engineering at the University of Massachusetts Lowell. He graduated 22 PhD students and 19 MS students. He published 250 papers in leading journals and peer reviewed conferences. His research interests are energy efficient reconfigurable secure architectures and sustainable computing design. He is a senior member of ACM and IEEE.

A hybrid FPGA accelerator for MobileNets

Speaker: Fareed Mohammad Qararyah - Chalmers University of Technology

Abstract: Deep Learning is currently a key technology for a large variety of applications. While the trend shows that the models are becoming more complex and larger, there are considerable efforts in producing compact and efficient models too. MobileNets fall into the latter category and thus are interesting for IoT applications. The optimizations at the model level have resulted in using operations that do not fit as well off-the-shelf generic AI accelerators. Consequently, there is a growing interest in exploring FPGA-based accelerators. The challenges are that dataflow implementations of the complete model are not feasible or flexible and using compute kernels may result in large memory overheads. As such, in this work we propose a Hybrid architecture that combines the benefits of compute kernels and dataflow. Our preliminary results for MobileNetv2 show a speedup potential of 1.5 when compared to state-of-the-art compute kernels-based accelerator.

Short bio

Fareed Qararyah is a PhD student at Chalmers University of Technology. He is focusing on hardware-software codesign for developing efficient Deep Learning accelerators. He is currently working on VEDLIoT, an EU-funded project that aims at achieving very efficient Deep Learning across the IoT continuum. He got his Masters degree in High Performance Computing from Koc University.

DeepChip and its use of FPGAs for Embedded Machine Learning

Speaker: Holger Fröning - Heidelberg University

Abstract: The DeepChip project at Heidelberg University and others has meantime evolved from a single project to a set of projects under the umbrella of embedded machine learning. In this collection of research efforts, we are most concerned with the intersection of machine learning and hardware systems, in particular targeting inference and resource-constrained embedded systems. This talk will review the most important findings of the last couple of years with a particular focus on FPGA processors. The talk will conclude with a couple of anticipated research directions.

Short bio

Holger Fröning is a full professor and leads the Computing Systems Group at the Institute of Computer Engineering at Heidelberg University. His research interests focus on embedded machine learning and high-performance computing, and include hardware and software architectures, programmability, co-design, data movement optimizations, and associated power and energy aspects. Previously, he was associate professor at the same university. In 2016, he was with NVIDIA Research (Santa Clara, CA, US) as visiting scientist, sponsored by Bill Dally. Early 2015 he was visiting professor at the Graz University of Technology (Austria), sponsored by Gernot Kubin. From 2008 to 2011 he reported to Jose Duato from the Technical University of Valencia (Spain). He has received his PhD and MSc degrees 2007 respectively 2001 from University of Mannheim, Germany. In 2021, he was awarded visiting scientist at the Chinese Academy of Sciences. In 2014, he received the prestigious Google Faculty Research Award. Four of his publications have received a best paper award (IPDPS, ICPP, among others), and parts of his research results have been commercialized. He co-organizes the Workshop on Embedded Machine Learning (WEML) and the Workshop on IoT, Edge, and Mobile for Embedded Machine Learning (ITEM) on a regular basis. He is local co-chair for IEEE CLUSTER 2022, chaired tracks for EuroPar 2015 and International Supercomputer Conference 2017, and recently served on the program committee of IPDPS2021/19, CCGRID2020/19, SC2017, ICPP2022/21/20, FPL2022/21/20, and Euro-Par2019. He is frequently providing reviews for established journals, such as IEEE Micro, TPDS, and JPDC. His recent sponsors include DFG, FWF, FFG, Carl-Zeiss Foundation, NVIDIA, SAP, and XILINX. For more information, visit his website: http://www.ziti.uni-heidelberg.de/compsys