• Publications
  • Influence
Recent Developments in Document Clustering
This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
TLDR
It is concluded that BERT can be pruned once during pre-training rather than separately for each task without affecting performance, and that fine-tuning BERT on a specific task does not improve its prunability.
Learning Invariant Representations of Social Media Users
TLDR
A novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users’ invariant features is proposed.
Name Phylogeny: A Generative Model of String Variation
TLDR
The variational EM learning algorithm can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
Entity Clustering Across Languages
TLDR
The approach extends standard clustering algorithms with cross-lingual mention and context similarity measures, and does not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown.
Predicting Twitter User Demographics from Names Alone
TLDR
Character-level neural models that learn a representation of a user’s name and screen name to predict gender and ethnicity are explored, allowing for demographic inference with minimal data.
Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation
TLDR
A new Twitter corpus that contains entity annotations for entity clusters that supports CDCR is introduced, drawing from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event.
PARMA: A Predicate Argument Aligner
We introduce PARMA, a system for crossdocument, semantic predicate and argument alignment. Our system combines a number of linguistic resources familiar to researchers in areas such as recognizing
Robust Entity Clustering via Phylogenetic Inference
TLDR
A model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data is proposed and a block Gibbs sampler for posterior inference and an empirical evaluation on several datasets are presented.
Clustering for Data Reduction: A Divide and Conquer Approach
TLDR
This work first roughly divides the data into balanced clusters using bisecting k-means and spectral cuts, and then finds the prototypes for each cluster by affinity propagation, which performs an order of magnitude faster than simply looking for prototypes on the entire dataset.
...
1
2
3
...