Bitwise Neural Networks on FPGA

High-Speed and Low-Power

Yunfan Ye      Yayun Huang

Checkpoint

Figure 1. Illustration of parallel alogrithm in FPGA

Progress Review

Baseline Implementation. As stated in the proposal, we do not have readily available code for BNN. Thus, we implement a baseline method based on the paper. Specifically, it takes two steps to train a bitwise neural network. First, we trained a real-valued network that takes either bitwise inputs or real-valued inputs ranged between -1 and 1 by TensorFlow. Second, we binarized the network and use noisy backpropagation to update the weights.

FPGA implementation. With a fairly good pretrained BNN, we began to implement code on FPGA. Given the amount of weights of the neural network, it is impossible to all weights into logic gates. Therefore, we believe that we need to store all the binary weights in on-chip RAM and calculate the results block by block. Specifically, in our scenario, we properly arrange the the weights to 1024 bits per block. Every time we read out 1024 weights, calculate and store the intermedia results in counters. Then, we shift all the input data by 1 bit. Read the new weights and repeat the previous steps for 1024 times. Finally, if the counter has counted more 1s than 0s, we output 1; otherwise, we output 0. This module is shown in Figure 1, and this module can be reused for different layers to reduce the resource consumptions.

Exhibits for the Parallelism Competition

Deep neural networks (DNNs) have substantially pushed the state-of the-art in a wide range of tasks, including speech recognition and computer vision. However, DNNs also require a wealth of compute resources, and thus can benefit greatly from parallelization.

Moreover, power consumption recently has gained massive attentions due to the emerging of mobile devices. As is well know, running real-time text detection and recognition tasks on standalone devices, like a pair of glasses, will quickly drain the battery. Therefore, we may need to exploit the heterogeneity of hardware to boost both performance and efficiency.

Remaining Issues

Currently, we get stuck with how to choose a proper FPGA with enough storage of nearly 6MB for our implementation.

Deliverables

  • (Completed) Baseline implementation
    Since there is no readily available codebase, we will try to implement a baseline algorithm in open-source frameworks, like caffe, torch or tensorflow.
  • Evaluation
    Evaluation. We will evaluate our implementation by comparing our runtime to the baseline.
  • Hardware demo
    Hardware demo. Hopefully, we can deliver a hardware demo on FPGA. The ideal demo would be of three parts: a camera used to generate inputs, a segment display used to display the results, and FPGA.

Schedule

Apr 2016
Apr 9

(Completed) Get familiar with training bitwise neural networks and implement a correct baseline algorithm.

Apr 16

(Completed) Tune parameters and implement retrain process to approach the results of the paper.

Apr 23

(Completed) Implement classifying phase by VHDL.

Apr 29

(Completed) Prepare for final exam and analyze the feasibility of porting the code to hardware.

May 2016
May 6

Port the code to hardware and analyze the results. [3].

May 8

Write final report and prepare for competition.