A super real-time RNN framework with compressed structured block

Abstract

This paper presents CSB-RNN, an optimized full-stack RNN framework with the novel compressed structured block (CSB) technique. The CSB-pruned RNN model comes with both fine-granularity that benefits the pruning rate and regular structure that facilitates the hardware-parallelism. Further, we propose a novel hardware architecture for inferencing the CSB-pruned model, which solves the block-workload imbalance issue and achieves an over 95% hardware utilization. CSB-RNN achieves 1.7×-3.6× improvement on the pruning rate comparing to the prior art. With the addition of novel architecture, the compressed-RNN inference reaches a super real-time latency of 23µs-67µs on FPGA implementation.

Publication
2020 Boston Area Architecture Workshop
Hayden Kwok-Hay So
Hayden Kwok-Hay So
Associate Professor