This article presents the hardware design of the 16x16 2-D DCT used in the new video coding standard, the HEVC – High Efficiency Video Coding. The transforms stage is one of the innovations proposed by HEVC, since a variable size transforms stage is available (from 4x4 to 32x32), allowing the use of transforms with larger dimensions than used in previous standards. The presented design explores the 2-D DCT separability property, using two instances of the one-dimension DCT. The architecture focuses on low hardware cost and high throughput, thus the HEVC 16-points DCT algorithm was simplified targeting a more efficient hardware implementation. Operations and hardware minimization strategies were used in order to achieve such simplifications: operation reordering, factoring, multiplications to shift-adds conversion, and sharing of common sub-expressions. The 1-D DCT architectures were designed in a fully combinational way in order to reduce control overhead. A transposition buffer is used to connect the two 1-D DCT architectures. The synthesis was directed to Stratix III FPGA and TSMC 65nm standard cells technologies. The complete 2-D DCT architecture is able to achieve real-time processing for high and ultra-high definition videos, such as Full HD, QFHD and UHD 8K. When compared with related works, the architectures designed in this work reached the highest throughput and the lowest hardware resources consumption.