MATE: Multi-view Attention for Table Transformer Efficiency

  title={MATE: Multi-view Attention for Table Transformer Efficiency},
  author={Julian Martin Eisenschlos and Maharshi Gor and Thomas M{\"u}ller and William W. Cohen},
This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables… 
Iterative Hierarchical Attention for Answering Complex Questions over Long Documents
A new model that iteratively attends to different parts of long, heirarchically structured documents to answer complex questions, DOCHOPPER, which achieves state-of-the-art results on three of the datasets and is efficient at inference time, being 3–10 times faster than the baselines.
UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models
  • Tianbao Xie, Chen Henry Wu, +20 authors Tao Yu
  • Computer Science
  • 2022
Structured knowledge grounding (SKG) leverages structured knowledge to complete user requests, such as semantic parsing over databases and question answering over knowledge bases. Since the inputs


DoT: An efficient Double Transformer for NLP tasks with tables
This work proposes a new architecture, DoT, a double transformer model, that decomposes the problem into two sub-tasks: A shallow pruning transformer that selects the top-K tokens, followed by a deep task-specific transformer that takes as input those K tokens.
Linformer: Self-Attention with Linear Complexity
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.
Reformer: The Efficient Transformer
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences.
Understanding tables with intermediate pre-training
This work adapts TAPAS (Herzig et al., 2020), a table-based BERT model, to recognize entailment, and creates a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning.
Efficient Transformers: A Survey
This paper characterizes a large and thoughtful selection of recent efficiency-flavored "X-former" models, providing an organized and comprehensive overview of existing work and models across multiple domains.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
TaPas: Weakly Supervised Table Parsing via Pre-training
TaPas is presented, an approach to question answering over tables without generating logical forms that outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA and performing on par with the state of theart on WikiSQL and WikiTQ, but with a simpler model architecture.
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
TaBERT is a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables that achieves new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.
Are Transformers universal approximators of sequence-to-sequence functions?
It is established that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.