• Corpus ID: 242757575

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

  title={Benchmarking Multimodal AutoML for Tabular Data with Text Fields},
  author={Xingjian Shi and Jonas Mueller and Nick Erickson and Mu Li and Alexander J. Smola},
We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark2 enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy… 

Figures and Tables from this paper

Multimodal AutoML for Image, Text and Tabular Data

This tutorial demonstrates fundamental techniques that powers up multimodal AutoML, and focuses on automatically building and training deep learning models, which are powerful yet cumbersome to manage manually.

A Robust Stacking Framework for Training Deep Graph Models with Multifaceted Node Features

This work proposes a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data, which are ensembled and stacked in multiple layers and leverages bagging and stacking strategies to enjoy strong generalization, in a manner which effectively mitigates label leakage and overfitting.

FDB: Fraud Dataset Benchmark

FDB is introduced, a compilation of publicly available datasets catered to fraud detection which comprises variety of fraud related tasks, ranging from identifying fraudulent card-not-present transactions, detecting bot attacks, classifying malicious URLs, predicting risk of loan to content moderation.

Fraud Dataset Benchmark for Automated Machine Learning Pipelines

Using FDB tools, 3 AutoML pipelines including AutoGluon, H2O and Amazon Fraud Detector are evaluated across 9 different fraud detection datasets and discuss the results.

AMLB: an AutoML Benchmark

An open and extensible benchmark is introduced that follows best practices and avoids common mistakes when comparing AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation.

Towards Automated Distillation: A Systematic Study of Knowledge Distillation in Natural Language Processing

This work proposes Distiller, a meta-KD framework that systematically combines the key distillation techniques as components across different stages of the KD pipeline, and proposes a simple AutoDistiller algorithm that can recommend a close-to-optimal KD pipeline for a new dataset/task.

Classifying Characteristics of Opioid Use Disorder From Hospital Discharge Summaries Using Natural Language Processing

This pilot study demonstrated that rich information regarding problematic opioid use can be manually identified by annotators, but more training samples and features would improve the ability to reliably identify less common classes from clinical text, including text from outpatient settings.



TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data

TaBERT is a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables that achieves new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models on an unprocessed tabular dataset such as a

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

The TabTransformer is a novel deep tabular data modeling architecture for supervised and semi-supervised learning built upon self-attention based Transformers that outperforms the state-of-the-art deep learning methods fortabular data by at least 1.0% on mean AUC, and matches the performance of tree-based ensemble models.

Multimodal deep networks for text and image-based document classification

A multimodal neural network able to learn from word embeddings, computed on text extracted by OCR, and from the image is designed that boosts pure image accuracy by 3% on Tobacco3482 and RVL-CDIP augmented by the new QS-OCR text dataset, even without clean text information.

TURL: Table Understanding through Representation Learning

This paper proposes a structure-aware Transformer encoder to model the row-column structure of relational tables, and presents a new Masked Entity Recovery objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data.

Universal Language Model Fine-tuning for Text Classification

This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set

This work repeats multiple experiments from recent work on neural models for low-resource NLP and compares results for models obtained by training with and without development sets, highlighting the importance of realistic experimental setups in the publication of low- resources research results.

Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

This work proposes FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks, and produces significantly better individual models than one obtains through standard training on the original data.