The proposed contention-aware scheduling preserves the theoretical basis of task scheduling, and it is shown how classic list scheduling is easily extended to this more accurate system model.
A set of new techniques to improve the implementation of the SHA-2 hashing algorithm consist mostly in operation rescheduling and hardware reutilization, allowing a significant reduction of the critical path while the required area also decreases.
Algorithms and data structures suitable for parallel computing are proposed in this paper to perform LDPC decoding on multicore architectures and achieve throughputs that in some cases approach very well those obtained with VLSI decoders.
An alternative flexible LDPC decoder is proposed that exploits data-parallelism for simultaneous multicodeword decoding, supported by multithreading on CUDA-based graphics processing units (GPUs) and shows throughputs above 100Mbps and BER curves that compare well with ASIC solutions.
This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization.
This paper investigates the involvement of the processor in communication and its impact on task scheduling and a new system model is proposed based on the contention model that is aware of theprocessor involvement.
Comparisons to state-of-the-art AES cores indicate that the proposed unfolded core outperforms the most recent works by 34% in throughput and requires 68% less reconfigurable area.
Experimental results show that the proposed ASIP architecture is able to estimate motion vectors in real time for QCIF and CIF video sequences with a very low-power consumption and is also able to adapt the operation to the available energy level in runtime.
The proposed architecture is comprehensive, providing modulo (2/sup n/+1) multipliers with similar performance and cost both for the ordinary and for the diminished-1 number representations, and is the only one taking advantage of this recoding to obtain faster multiplier with a significant reduction in hardware.