Learn More
MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle,(More)
Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework(More)
Clipping Web pages, namely extracting the informative clips (areas) from Web pages, has many applications, such as Web printing and e-reading on small handheld devices. Although many existing methods attempt to address this task, most of them can either work only on certain types of Web pages (e.g., news- and blog-like web pages), or perform(More)
Nowadays vast amounts of data are being produced in continuous ways. They may come from sensors, smart meters, application logs, monitoring software etc. The data need to be processed in realtime to gain actionable insights. Services like smart grid load balancing, cloud platform maintenance, can be carried out in an efficient way. Stream processing is the(More)
The different applications make performance evaluation for data-intensive large-scale systems become a very important work. General test methods pursue the peak value as the final result without paying enough attention on resource utilization. However, the recent studies have proved that the behavior of resources can reflect the latent problems. In this(More)
We describe a method to extract style and branding elements from multiple web pages in a given site for content repurposing. Style and branding elements convey the values of the site owners effectively and connect with the target prospects. They are manifested through logos, graphical elements, background color, font styles, font colors and other(More)
Fork-join is a basic query processing model in shared-nothing parallel database systems. A query Q is decomposed into a number of sub-queries, and each of which is processed independently on a processing element(PE), then all the results of sub-queries are "joined" and returned as Q's results. In this scheme, the query processing time of Q depends on the(More)
  • 1