Learn More
Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data(More)
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) Today’s programming languages/libraries have no explicit support for NUMA(More)
Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). A mismatch between the data access patterns of programs and the mapping of data to memory incurs a high overhead, as remote accesses have higher latency and lower throughput than local accesses. This paper reports on a limit study that shows that many scientific(More)
Future exascale systems will be based on multi-core processors , but even today's multi-core processors can be asymmetric and exhibit limitations and bottlenecks that are different from those found on a symmetric multipro-cessor. In this paper we investigate the performance of a cluster node based on the Intel Xeon E5345 quad-core processor and note that(More)
—An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence(More)
Many multicore multiprocessors have a non-uniform memory architecture (NUMA), and for good performance, data and computations must be partitioned so that (ideally) all threads execute on the processor that holds their data. However, many multithreaded applications show heavy use of shared data structures that are accessed by all threads of the application.(More)
This paper presents the design and implementation of an embedded system for real-time network flow identification. The system identifies data flows based on packet inspection. The main advantage of this system is that it reduces significantly the processing time required for the flow identification. For the hardware implementation, a Xilinx Virtex-II Pro(More)
  • 1