Bitwise Neural Networks on FPGA

High-Speed and Low-Power

Yunfan Ye      Yayun Huang

Proposal

Summary

We implemented bitwise neural networks on FPGA and run tests on the MNIST dataset. Experiments show that we achieve 4x speed compared with the state-of-the-art FPGA implementation.

Background

Deep neural networks (DNNs) have substantially pushed the state-of the-art in a wide range of tasks, including speech recognition and computer vision. However, DNNs also require a wealth of compute resources, and thus can benefit greatly from parallelization.

Moreover, power consumption recently has gained massive attentions due to the emerging of mobile devices. As is well known, running real-time text detection and recognition tasks on standalone devices, like a pair of glasses, will quickly drain the battery. Therefore, we need to exploit the heterogeneity of hardware to boost both performance and efficiency.

The attempt to implement neural networks on FPGA dates back to 24 years ago [1], yet no design shows a potential of commercial use until recently [2]. However, the recent design still suffers from high resource usage and high latency. In this work, we propose a resource-efficient implementation, which also has higher throughput and lower latency.

Platform

The bitwise neural networks (BNNs) can further drive down the power consumptions by eliminating power-consuming operations like multiplication [1]. Hence, we expect that a well-implemented BNN on heterogeneous hardware can enable a real-time and always-on service. Specifically, a field-programmable gate array (FPGA) is a reconfigurable hardware and is particularly suitable for fast bit operations, and thus we will first try to port the algorithm on FPGA.

Challenge

  • Neural networks normally update the weights using back-propagation. Thus, dependencies exist between layers and we need to design proper work assignments inside a layer.
  • Neural networks consist of a massive amount of weights, and the training process involves a lot of iterations. Thus, data locality should be of central concern.
  • BNNs anticipate binary inputs, and thus we need to preprocess the raw image, also in parallel.
  • We need to devise a good pipelining strategy to maximize the throughput.

Resources

  • Codebase
    To the best of our knowledge, there is no readily available code for BNNs. Therefore, we may start from the algorithm described in the paper [1].
  • High-performance Hardware
    The training phase of the neural networks require very powerful compute resources, and thus we can exploit the latedays machines.
  • Reconfigurable Hardware
    BNNs can greatly benefit from hardware acceleration. Ideally, we can implement the classifying phase on reconfigurable hardware, like FPGA.

Deliverables

  • (Completed) Baseline implementation
    Since there is no readily available codebase, we will try to implement a baseline algorithm in open-source frameworks, like caffe, torch or tensorflow.
  • (Completed) Evaluation
    Evaluation. We will evaluate our implementation by comparing our runtime to the baseline.
  • Hardware demo
    Hardware demo. Hopefully, we can deliver a hardware demo on FPGA. The ideal demo would be of three parts: a camera used to generate inputs, a segment display used to display the results, and FPGA.

Schedule

Apr 2016
Apr 9

(Completed) Get familiar with training bitwise neural networks and implement a correct baseline algorithm.

Apr 16

(Completed) Tune parameters and implement retrain process to approach the results of the paper.

Apr 23

(Completed) Implement classifying phase by VHDL.

Apr 29

(Completed) Prepare for final exam and analyze the feasibility of porting the code to hardware.

May 2016
May 6

(Completed) Port the code to hardware and analyze the results. [3].

May 8

(Completed) Write final report and prepare for competition.

Reference

  • [1] Minje Kim, and Paris Smaragdis. Bitwise Neural Networks. arXiv preprint arXiv:1601.06071v1, 2016
  • [2] Stuart Byma. Programming Datacenter-Scale Reconfigurable Systems. Accessed April 1st.
  • [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385v1, 2015