Learn More
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the(More)
Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay(More)
MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze(More)
Database administrators of Online Transaction Processing (OLTP) systems constantly face difficult questions. For example, "What is the maximum throughput I can sustain with my current hardware?", "How much disk I/O will my system perform if the requests per second double?", or "What will happen if the ratio of transactions in my system changes?". Resource(More)
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present(More)
Efficient data storage is a key component of data managing systems to achieve good performance. However, currently data storage is either heavily constrained by static decisions (e.g. fixed data stores in DBMSs) or left to be tuned and configured by users (e.g. manual data backup in File Systems). In this paper, we take a holistic view of data storage and(More)
In this paper, we present V, a graph analytics tools on top of a relational database, which is user friendly and yet highly ee-cient. Instead of constraining programmers to SQL, V offers a popular vertex-centric query interface, which is more natural for analysts to express many graph queries. e programmers simply provide their vertex-compute functions and(More)