Learn More
<i>Batched stream processing</i> is a new distributed data processing paradigm that models recurring batch computations on incrementally bulk-appended data streams. The model is inspired by our empirical study on a trace from a large-scale production data-processing cluster; it allows a set of effective query optimizations that are not possible in a(More)
As the study of large graphs over hundreds of gigabytes becomes increasingly popular for various data-intensive applications in cloud computing, developing large graph processing systems has become a hot and fruitful research area. Many of those existing systems support a <i>vertex-oriented</i> execution model and allow users to develop custom logics on(More)
Mining retrospective events from text streams has been an important research topic. Classic text representation model (i.e., vector space model) cannot model temporal aspects of documents. To address it, we proposed a novel burst-based text representation model, denoted as BurstVSM. BurstVSM corresponds dimensions to bursty features instead of terms, which(More)
Cloud computing allows users to perform computation in a public cloud with a pricing scheme typically based on incurred resource consumption. While cloud computing is often considered as merely a new application for classic distributed systems, we argue that, by decoupling users from cloud providers with a pricing scheme as the bridge, cloud computing has(More)
Map/Reduce style data-parallel computation is characterized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages to prepare data partitions for parallel computation. Instead of treating user-defined functions as “black boxes”, we propose to analyze those functions to turn them into “gray boxes” that expose(More)
As the study of large graphs over hundreds of gigabytes becomes increasingly popular in cloud computing, efficiency and programmability of large graph processing tasks challenge existing tools. The inherent random access pattern on the graph generates significant amount of network traffic. Moreover, implementing custom logics on the unstructured data in a(More)
We have designed and implemented Tianwang File System(TFS), which is a distributed file system much like Google file system(GFS). The system has its origins in our Tianwang search engine and web mining research work. Our system has the same assumptions and the same architectures with GFS. But the key design choice that the chunk size is variable lets our(More)
To minimize the amount of data-shuffling I/O that occurs between the pipeline stages of a distributed data-parallel program, its procedural code must be optimized with full awareness of the pipeline that it executes in. Unfortunately, neither pipeline optimizers nor traditional compilers examine both the pipeline and procedural code of a data-parallel(More)
We introduce the new Wave model for exposing the temporal relationship among the queries in data-intensive distributed computing. The model defines the notion of query series to capture the recurrent nature of batched computation on periodically updated input streams. This seemingly simple concept captures a significant portion of the queries we observed in(More)