• Corpus ID: 238583495

SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning

  title={SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning},
  author={Talip Ucar and Ehsan Hajiramezanali and Lindsay Edwards},
Self-supervised learning has been shown to be very effective in learning useful representations, and yet much of the success is achieved in data types such as images, audio, and text. The success is mainly enabled by taking advantage of spatial, temporal, or semantic structure in the data through augmentation. However, such structure may not exist in tabular datasets commonly used in fields such as healthcare, making it difficult to design an effective augmentation method, and hindering a… 

Figures and Tables from this paper


VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain
This paper creates a novel pretext task of estimating mask vectors from corrupted tabular data in addition to the reconstruction pretext task for self-supervised learning, and introduces a noveltabular data augmentation method for selfand semi- supervised learning frameworks.
Learning Discrete Representations via Information Maximizing Self-Augmented Training
In IMSAT, data augmentation is used to impose the invari-ance on discrete representations and the predicted representations of augmented data points to be close to those of the original data points in an end-to-end fashion to maximize the information-theoretic dependency between data and their predicted discrete representations.
A Framework For Contrastive Self-Supervised Learning And Designing A New Approach
A conceptual framework that characterizes CSL approaches in five aspects, and shows the utility of this framework by designing Yet Another DIM (YADIM) which achieves competitive results on CIFAR-10, STL-10 and ImageNet, and is more robust to the choice of encoder and the representation extraction strategy.
Representation Learning with Contrastive Predictive Coding
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
Context Encoders: Feature Learning by Inpainting
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Extracting and composing robust features with denoising autoencoders
This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.
Stacked Capsule Autoencoders
This work introduces an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects, and finds that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for un supervised classification on SVHN and MNIST.
TabNet: Attentive Interpretable Tabular Learning
It is demonstrated that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior.
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
TaBERT is a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables that achieves new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.
Improved Baselines with Momentum Contrastive Learning
With simple modifications to MoCo, this note establishes stronger baselines that outperform SimCLR and do not require large training batches, and hopes this will make state-of-the-art unsupervised learning research more accessible.