Learn More
We describe the design and implementation of the Cornell Cayuga System for scalable event processing. We present a query language based on Cayuga Algebra for naturally expressing complex event patterns. We also describe several novel system design and implementation issues, focusing on Cayuga's query processor, its indexing approach, how Cayuga handles(More)
We propose a demonstration of Cayuga, a complex event monitoring system for high speed data streams. Our demonstration will show Cayuga applied to monitoring Web feeds; the demo will illustrate the expressiveness of the Cayuga query language, the scalability of its query processing engine to high stream rates, and a visualization of the internals of the(More)
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scal-able implementations of tree learning have been proposed, they typically require specialized parallel computing(More)
— Modern science is collecting massive amounts of data from sensors, instruments, and through computer simulation. It is widely believed that analysis of this data will hold the key for future scientific breakthroughs. Unfortunately, deriving knowledge from large high-dimensional scientific datasets is difficult. One emerging answer is exploratory analysis(More)
We address a new learning problem where the goal is to build a predictive model that minimizes prediction time (the time taken to make a prediction) subject to a constraint on model accuracy. Our solution is a generic framework that leverages existing data mining algorithms without requiring any modifications to these algorithms. We show a first application(More)
Simulation is one of the most powerful tools that scientists have at their disposal for studying and understanding real-world physical phenomena. In order to be realistic, the mathematical models which drive simulations are often very complex and run for a very large number of simulation steps. The required computational resources often make it infeasible(More)
Archived web data is a great resource for scientific research, but poses serious challenges in data processing and management. We demonstrate the Web Lab Collaboration Server, a platform and service for large-scale collaborative web data analysis in a distributed computing environment, and show how it seamlessly supports non-technical users during search,(More)
The formulation of hypotheses based on patterns found in data is an essential component of scientific discovery. As larger and richer data sets become available, new scalable and user-friendly tools for scientific discovery through data analysis are needed. We demonstrate Scolopax, which explores the idea of a search engine for hypotheses. It has an(More)