Specializing Word Embeddings (for Parsing) by Information Bottleneck

  title={Specializing Word Embeddings (for Parsing) by Information Bottleneck},
  author={Xiang Lisa Li and Jason Eisner},
Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag… 

Figures and Tables from this paper

Specializing Word Embeddings (for Parsing) by Information Bottleneck (Extended Abstract)

A very fast variational information bottleneck (VIB) method to nonlinearly compress word embeddings, keeping only the information that helps a discriminative parser.

SemGloVe: Semantic Co-Occurrences for GloVe From BERT

SemGloVe is proposed, which distills semantic co-occurrences from BERT into static GloVe word embeddings and can define the co- Occurrence weights by directly considering the semantic distance between word pairs.

Learned Incremental Representations for Parsing

We present an incremental syntactic representation that consists of assigning a single discrete label to each word in a sentence, where the label is predicted using strictly incremental processing of

Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

A neural conditional random field autoencoder (CRF-AE) model for unsupervised POS tagging, inspired by feature-rich HMM, which outperforms previous state-of-the-art models on Penn Treebank and multilingual Universal Dependencies treebank v2.0.

Deep Clustering of Text Representations for Supervision-Free Probing of Syntax

This work explores deep clustering of multilingual text representations for unsupervised model interpretation and induction of syntax and finds that Multilingual BERT (mBERT) contains surprising amount of syntactic knowledge of English; possibly even as much as English Bert (E-BERT).

Variational Information Bottleneck for Effective Low-Resource Fine-Tuning

This work proposes to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks, and shows that the method successfully reduces overfitting.

Finding Universal Grammatical Relations in Multilingual BERT

An unsupervised analysis method is presented that provides evidence mBERT learns representations of syntactic dependency labels, in the form of clusters which largely agree with the Universal Dependencies taxonomy, suggesting that even without explicit supervision, multilingual masked language models learn certain linguistic universals.

An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction

This paper shows that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective, and derives a learning objective that allows direct control of mask sparsity levels through a tunable sparse prior.

Building Interpretable Interaction Trees for Deep NLP Models

This paper proposes a method to disentangle and quantify interactions among words that are encoded inside a DNN for natural language processing. We construct a tree to encode salient interactions



Dissecting Contextual Word Embeddings: Architecture and Representation

There is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks, suggesting that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation

This paper describes the system (HIT-SCIR) submitted to the CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies, which was ranked first according to LAS and outperformed the other systems by a large margin.

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.

A Structural Probe for Finding Syntax in Word Representations

A structural probe is proposed, which evaluates whether syntax trees are embedded in a linear transformation of a neural network’s word representation space, and shows that such transformations exist for both ELMo and BERT but not in baselines, providing evidence that entire syntax Trees are embedded implicitly in deep models’ vector geometry.

What do you learn from context? Probing for sentence structure in contextualized word representations

A novel edge probing task design is introduced and a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline are constructed to investigate how sentence structure is encoded across a range of syntactic, semantic, local, and long-range phenomena.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Deep Biaffine Attention for Neural Dependency Parsing

This paper uses a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels, and shows which hyperparameter choices had a significant effect on parsing accuracy, allowing it to achieve large gains over other graph-based approach.

Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.

AllenNLP: A Deep Semantic Natural Language Processing Platform

AllenNLP is described, a library for applying deep learning methods to NLP research that addresses issues with easy-to-use command-line tools, declarative configuration-driven experiments, and modular NLP abstractions.

Accurate Unlexicalized Parsing

We demonstrate that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence