We are currently developing Willow a shared memory multiprocessor whose design provides system capacity and performance capable of supporting over a thousand commercial microprocessors Most recently we have focused our attention on the design of a sixty four processor prototype that tests most of our ideas about scalability The design of such a multiprocessor poses a number of challenges to the computer architect In this paper we describe the factors that traditionally have limited the scalability of shared memory systems These include enforcing sequential consistency ine cient synchronization memory latency and bandwidth limitations bus memory contention the necessity to enforce inclusion on lower level caches and limited I O bandwidth We then describe how the Willow architecture addresses each of these issues Finally we present data that evaluates the e ect of the major architectural innovations in Willow on the performance of several parallel applications These innovations include a hierarchical memory cache synchronization and I O structure that exploits program locality at all levels in the hierarchy support for adaptive cache coherence whereby the coherence protocol used to manage each cache line is chosen based on the expected or observed access behavior for that line the use of a relaxed cache consistency model and aggressive write bu ering and an e cient access combining protocol within the cache hierarchy Our data was obtained and con rmed using two di erent simulators one a detailed hardware level simulator the other an execution driven simulator whose accuracy was validated against the detailed simulator The data related to I O was obtained from an analytical model developed for that purpose This work was supported in part by the National Science Foundation under Grant CCR and by a National Science Foundation Graduate Fellowship

