Fast inference is of paramount value to a wide range of deep learning applications. To address the architecture and hardware mismatch faced by traditional efforts, this work presents FTDL, a highlyscalable FPGA overlay framework for deep learning applications. The FTDL overlay is specifically optimized for the tiled structure of FPGAs, thereby achieving post-place-and-route operating frequencies exceeding 88 % of the theoretical maximum across different devices and design scales. A flexible compilation framework efficiently schedules matrix multiply and convolution operations of large neural network inference on the overlay and achieved over 80 % hardware-efficiency on average. Taking advantage of both high operating-frequency and hardwareefficiency, FTDL achieves 402.6 and 151.2 FPS with GoogLeNet and ResNet50 on ImageNet respectively while operating at a power efficiency of 27.6 GOPS/W, making it up to 7.7× higher performance and 1.9× more power efficient than prior art.