Learn More
The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for(More)
  • Kristina Hamachi Lacommare, Joseph H Eto, Ernest Orlando Lawrence, Ernest Orlando, Sam Baldwin, John Boyes +19 others
  • 2004
The massive electric power blackout in the northeastern United States and Canada on August 14-15, 2003 resulted in the U.S. electricity system being called " antiquated " and catalyzed discussions about modernizing the grid. Industry sources suggested that investments of $50 to $100 billion would be needed. This report seeks to quantify an important piece(More)
Data is increasingly being bought and sold online, and Web-based marketplace services have emerged to facilitate these activities. However, current mechanisms for pricing data are very simple: buyers can choose only from a set of explicit views, each with a specific price. In this paper, we propose a framework for pricing data on the Internet that, given(More)
Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many(More)
We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent,(More)
The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data(More)
In this demonstration, we will showcase Myria, our novel cloud service for big data management and analytics designed to improve productivity. Myria's goal is for users to simply upload their data and for the system to help them be self-sufficient data science experts on their data -- self-serve analytics. Using a web browser, Myria users can upload data,(More)
— As the datasets used to fuel modern scientific discovery grow increasingly large, they become increasingly difficult to manage using conventional software. Parallel database management systems (DBMSs) and massive-scale data processing systems such as MapReduce hold promise to address this challenge. However, since these systems have not been expressly(More)
We analyze Hadoop workloads from three di↵erent research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and how well the use of the system matches its design. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools. We see(More)
Database technology remains underused in science, especially in the long tail the small labs and individual researchers that collectively produce the majority of scientic output. These researchers increasingly require iterative, ad hoc analysis over ad hoc databases but cannot individually invest in the computational and intellectual infrastructure required(More)