• Publications
  • Influence
Large-scale matrix factorization with distributed stochastic gradient descent
TLDR
We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements using an iterative stochastic optimization algorithm. Expand
ClausIE: clause-based open information extraction
TLDR
We propose ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text. Expand
Distributed Matrix Completion
TLDR
We discuss parallel and distributed algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. Expand
Jaql: A Scripting Language for Large Scale Semistructured Data Analysis
TLDR
This paper describes Jaql, a declarative scripting language for analyzing large semistructured datasets in parallel using Hadoop’s MapReduce framework. Expand
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
TLDR
In this paper, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. Expand
On synopses for distinct-value estimation under multiset operations
TLDR
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. Expand
Low-Latency Handshake Join
TLDR
This work revisits the processing of stream joins on modern hardware architectures and proposes a low-latency handshake join algorithm, which substantially reduces latency without sacrificing throughput or scalability. Expand
MinIE: Minimizing Facts in Open Information Extraction
TLDR
The goal of Open Information Extraction (OIE) is to extract surface relations and their arguments from natural-language text in an unsupervised, domain-independent manner. Expand
Fine-Grained Evaluation of Rule- and Embedding-Based Systems for Knowledge Graph Completion
TLDR
We present a fine-grained evaluation that gives insight into characteristics of the most popular datasets and points out the different strengths and shortcomings of the examined approaches. Expand
Ricardo: integrating R and Hadoop
TLDR
This paper is about the development of industrial-strength systems that support advanced statistical analysis over huge amounts of data. Expand
...
1
2
3
4
5
...