Matin Hashemi

Learn More
Many mobile applications running on smartphones and wearable devices would potentially benefit from the accuracy and scalability of deep CNN-based machine learning algorithms. However, performance and energy consumption limitations make the execution of such computationally intensive algorithms on mobile devices prohibitive. We present a GPU-accelerated(More)
We present a framework for development of streaming applications as concurrent software modules running on multi-processors system-on-chips (MPSoC). We propose an iterative design space exploration mechanism to customize MPSoC architecture for given applications. Central to the exploration engine is our system-level performance estimation methodology, that(More)
Heterogeneous soft multiprocessor systems are likely to find a larger share in the application-specific computing market due to increasing cost and defect rates in foreseeable manufacturing technologies. We study the problem of mapping streaming applications onto heterogeneous soft dual-processor systems, in which processors' limited memory resources and(More)
Many embedded applications demand processing of a seemingly endless stream of input data in real-time. Productive development of such applications is typically carried out by synthesizing software from high-level specifications, such as data-flow graphs. In this context, we study the problem of inter-actor buffer allocation, which is a critical step during(More)
Variants of dataflow specification models are widely used to synthesize streaming applications for distributed-memory parallel processors. We argue that current practice of specifying streaming applications using <i>rigid</i> dataflow models, implicitly prohibits a number of platform oriented optimizations and hence limits portability and scalability with(More)
We present a methodology for pipelined software synthesis of streaming applications. First, we develop a versatile task assignment algorithm capable of optimizing realistically-arbitrary cost functions for two cores. The algorithm is exact (i.e., theoretically optimal) contrary to existing heuristics. Second, our approximation technique provides an(More)
We study the problem of mapping concurrent tasks of an application to cores of a chip multiprocessor that utilize circuit-switched interconnect and global asynchronous local synchronous (GALS) clocking domains. We develop a configurable algorithm that naturally handles a number of practical requirements, such as architectural features of the target(More)
We study the trade-off between throughput and memory footprint of embedded software that is synthesized from acyclic static dataflow (task graph) specifications targeting distributed memory multiprocessors. We identify iteration overlapping as a knob in the synthesis process by which one can trade application throughput for its memory requirement. Given an(More)
Streaming applications, which are abundant in many disciplines such as multimedia, networking, and signal processing, require efficient processing of a seemingly infinite sequence of input data. In the context of streaming software synthesis from data flow graphs, we study the inherent trade-off between memory requirement and compilation runtime, under a(More)