Ming-Chuan Wu

Learn More
Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new(More)
Performant execution of data-parallel jobs needs good execution plans. Certain properties of the code, the data, and the interaction between them are crucial to generate these plans. Yet, these properties are dif�cult to estimate due to the highly distributed nature of these frameworks, the freedom that allows users to specify arbitrary code as operations(More)
Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been(More)
Data warehousing is a booming industry with many inter esting research problems The database research community has concen trated on only a few aspects In this paper We summarize the state of the art suggest architectural extensions and identify research prob lems in the areas of warehouse modeling and design data cleansing and loading data refreshing and(More)
Bitmaps are popular indexes for data warehouse (DW) applications and most database management systems offer them today. This paper proposes query optimization strategies for selections using bitmaps. Both <italic>continuous</italic> and <italic>discrete</italic> selection criteria are considered. Query optimization strategies are categorized into static and(More)
This paper presents an architecture overview of the distributed, heterogeneous query processor (DHQP) in the Microsoft SQL server database system to enable queries over a large collection of diverse data sources. The paper highlights three salient aspects of the architecture. First, the system introduces well-defined abstractions such as connections,(More)
An increasing number of applications require distributed data storage and processing infrastructure over large clusters of commodity hardware for critical business decisions. The MapReduce programming model [2] helps programmers write distributed applications on large clusters, but requires dealing with complex implementation details (e.g., reasoning with(More)