Fast inference is of paramount value to a wide range of deep learning applications. To address the architecture and hardware mismatch faced by traditional efforts, this work presents FTDL, a highlyscalable FPGA overlay framework for deep learning applications. The FTDL overlay is speciﬁcally optimized for the tiled structure of FPGAs, thereby achieving post-place-and-route operating frequencies exceeding 88 % of the theoretical maximum across different devices and design scales. A ﬂexible compilation framework efﬁciently schedules matrix multiply and convolution operations of large neural network inference on the overlay and achieved over 80 % hardware-efﬁciency on average. Taking advantage of both high operating-frequency and hardwareefﬁciency, FTDL achieves 402.6 and 151.2 FPS with GoogLeNet and ResNet50 on ImageNet respectively while operating at a power efﬁciency of 27.6 GOPS/W, making it up to 7.7× higher performance and 1.9× more power efﬁcient than prior art.