• Publications
  • Influence
Semi-Markov Conditional Random Fields for Information Extraction
Intuitively, a semi-CRF on an input sequence x outputs a "segmentation" of x, in which labels are assigned to segments rather than to individual elements of xi, and transitions within a segment can be non-Markovian. Expand
Discriminative Methods for Multi-labeled Classification
A new technique for combining text features and features indicating relationships between classes, which can be used with any discriminative algorithm is presented, which beat accuracy of existing methods with statistically significant improvements. Expand
Interactive deduplication using active learning
This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output. Expand
Modeling multidimensional databases
A data model and a few algebraic operations that provide semantic foundation to multidimensional databases and provide an algebraic application programming interface (API) that allows the separation of the front end from the back end are proposed. Expand
Annotating and searching web tables using entities, types and relationships
This paper proposes new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express, and a new graphical model for making all these labeling decisions for each table simultaneously. Expand
Efficient set joins on similarity predicates
This paper presents an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance that generalize to several weighted and unweighted measures of partial word overlap between sets. Expand
Generalizing Across Domains via Cross-Gradient Training
Empirical evaluation on three different applications establishes that (1) domain-guided perturbation provides consistently better generalization to unseen domains, compared to generic instance perturbations methods, and that (2) data augmentation is a more stable and accurate method than domain adversarial training. Expand
Automatic segmentation of text into structured records
A tool DATAMOLD is described that learns to automatically extract structure when seeded with a small number of training examples and enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information. Expand
On the Computation of Multidimensional Aggregates
This paper presents fast algorithms for computing a collection of group bys using sort-based and hashbased grouping methods with several .optimizations, like combining common operations across multiple groupbys, caching, and using pre-computed group-by8 for computing other groupbys. Expand
Learning with Graphical Models
Graphical models provide a powerful framework for probabilistic modelling and reasoning. Although theory behind learning and inference is well understood, most practical applications requireExpand