Andreas Paepcke

Learn More
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are(More)
Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to(More)
We introduce five methods for summarizing parts of Web pages on handheld devices, such as personal digital assistants (PDAs), or cellular phones. Each Web page is broken into text units that can each be hidden, partially displayed, made fully visible, or summarized. The methods accomplish summarization by different means. One method extracts significant(More)
We developed two photo browsers for collections with thousands of time-stamped digital images. Modern digital cameras record photo shoot times, and semantically related photos tend to occur in bursts. Our browsers exploit the timing information to structure the collections and to automatically generate meaningful summaries. The browsers differ in how users(More)
Document sources are available everywhere, both within the internal networks of organizations and on the Internet. Even individual organizations use search engines from di erent vendors to index their internal document collections. These search engines are typically incompatible in that they support di erent query models and interfaces, they do not return(More)
In this paper, we study the problem of constructing and maintaining a large shared repository of web pages. We discuss the unique characteristics of such a repository, propose an architecture, and identify its functional modules. We focus on the storage manager module, and illustrate how traditional techniques for storage and indexing can be tailored to(More)
Through a study of field biology practices, we observed that biology fieldwork generates a wealth of heterogeneous information, requiring substantial labor to coordinate and distill. To manage this data, biologists leverage a diverse set of tools, organizing their effort in paper notebooks. These observations motivated ButterflyNet, a mobile capture and(More)
We describe PhotoCompas, a system that utilizes the time and location information embedded in digital photographs to automatically organize a personal photo collection PhotoCompas produces browseable location and event hierarchies for the collection. These hierarchies are created using algorithms that interleave time and location to produce an organization(More)
Given time and location information about digital photographs we can automatically generate an abundance of related contextual metadata, using off-the-shelf and Web-based data sources. Among these are the local daylight status and weather conditions at the time and place a photo was taken. This metadata has the potential of serving as memory cues and(More)
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present <i>SpotSigs</i>, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor natural-language portions of Web pages over advertisements and(More)