Barzan Mozafari

Learn More
In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To(More)
Mining frequent itemsets from data streams has proved to be very difficult because of computational complexity and the need for real-time response. In this paper, we introduce a novel verification algorithm which we then use to improve the performance of monitoring and mining tasks for association rules. Thus, we propose a frequent itemset mining method for(More)
A recent phenomenon on the Web is the emergence and proliferation of new social media systems allowing social interaction between people. One of the most popular of these systems is Wikipedia that allows users to create content in a collaborative way. Despite its current popularity, not much is known about how users interact with Wikipedia and how it has(More)
In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA(More)
Database administrators of Online Transaction Processing (OLTP) systems constantly face difficult questions. For example, "What is the maximum throughput I can sustain with my current hardware?", "How much disk I/O will my system perform if the requests per second double?", or "What will happen if the ratio of transactions in my system changes?". Resource(More)
Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many(More)
Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are oŸen limited to small datasets (i.e., a few thousand items). is(More)
Much research attention has been given to delivering high-performance systems that are capable of <i>complex event processing</i> (CEP) in a wide range of applications. However, many current CEP systems focus on processing efficiently data having a simple structure, and are otherwise limited in their ability to support efficiently complex continuous queries(More)
Cloud computing is characterized by shared infrastructure and a decoupling between its operators and tenants. These two characteristics impose new challenges to databases applications hosted in the cloud, namely: (i) how to price database services, (ii) how to isolate database tenants, and (iii) how to optimize database performance on this shared(More)
Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, or sentiment analysis. However, due to the time and cost of human labor, solutions that solely rely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items).(More)