This work discusses the experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system, which allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory.
A performance evaluation of TreadMarks running on Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN and a 10-Mbps Ethernet supports the contention that, with suitable networking technology, DSM is a viable technique for parallel computation on clusters of workstations.
Lazy release consistency is a new algorithm for implementing release consistency that lazily pulls modifications across the interconnect only when necessary, which reduces both the number of messages and the amount of data transferred between processors.
This paper describes three benchmarks for evaluating the performance of Web sites with dynamic content, and implemented these three benchmarks with a variety of methods for building dynamic-content applications, including PHP, Java servlets and EJB (Enterprise Java Beans).
The performance of HDFS is analyzed and several performance issues are uncovered, including architectural bottlenecks exist in the Hadoop implementation that result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks.
This paper proposes Maestro which keeps the simple programming model for programmers, and exploits parallelism in every corner together with additional throughput optimization techniques, and experimentally shows that the throughput of Maestro can achieve near linear scalability on an eight core server machine.
This paper is the first to study the impact of the VMM scheduler on performance using multiple guest domains concurrently running different types of applications, and offers insight into the key problems in VMM scheduling for I/O and motivates future innovation in this area.
It is shown that any of the five MMU cache structures will reduce radix tree page table DRAM accesses far below an inverted page table, and the most effective MMU caches are translation caches, which store partial translations and allow the page walk hardware to skip one or more levels of the page table.
The overall impact of these optimizations is an improvement in transmit performance of guest domains by a factor of 4.4, and support for guest operating systems to effectively utilize advanced virtual memory features such as superpages and global page mappings.
A family of adaptive cache coherency protocols that dynamically identify migratory shared data in order to reduce the cost of moving them are described, which indicate that the use of the adaptive protocol can almost halve the number of inter-node messages on some applications.