New Properties of the Data Distillation Method When Working With Tabular Data

  title={New Properties of the Data Distillation Method When Working With Tabular Data},
  author={Dmitry Medvedev and Alexander D'yakonov},
Data distillation is the problem of reducing the volume oftraining data while keeping only the necessary information. With thispaper, we deeper explore the new data distillation algorithm, previouslydesigned for image data. Our experiments with tabular data show thatthe model trained on distilled samples can outperform the model trainedon the original dataset. One of the problems of the considered algorithmis that produced data has poor generalization on models with differenthyperparameters. We… 
2 Citations
Learning to Generate Synthetic Training Data using Gradient Matching and Implicit Differentiation
New data distillation techniques based on generative teaching networks, gradient matching, and the Implicit Function Theorem are suggested that are computationally more efficient than previous ones and allow to increase the performance of models trained on distilled data.
Deep Neural Networks and Tabular Data: A Survey
An in-depth overview of state-of-the-art deep learning methods for tabular data, categorizing these methods into three groups: data transformations, specialized architectures, and regularization models, and indicates that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning model development is stagnating.


Dataset Distillation
This paper keeps the model fixed and instead attempts to distill the knowledge from a large training dataset into a small one, to synthesize a small number of data points that do not need to come from the correct data distribution but will approximate the model trained on the original data.
Soft-Label Dataset Distillation and Text Dataset Distillation
This work proposes to simultaneously distill both images and their labels, thus assigning each synthetic sample a `soft' label (a distribution of labels) and demonstrates that text distillation outperforms other methods across multiple datasets.
Distilling the Knowledge in a Neural Network
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
Generic Methods for Optimization-Based Modeling
Experimental results on denoising and image labeling problems show that learning with truncated optimization greatly reduces computational expense compared to “full” fitting.
Deep Networks with Stochastic Depth
Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation.
Gradient-based Hyperparameter Optimization through Reversible Learning
This work computes exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure, which allows us to optimize thousands ofhyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.
Gradient-Based Optimization of Hyperparameters
This article presents a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameter gradient involving second derivatives of the training criterion.
On the limited memory BFGS method for large scale optimization
The numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence, and the convergence properties are studied to prove global convergence on uniformly convex problems.
Backpropagation Applied to Handwritten Zip Code Recognition
This paper demonstrates how constraints from the task domain can be integrated into a backpropagation network through the architecture of the network, successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service.
Automatic Differentiation of Algorithms for Machine Learning
The technique of automatic differentiation is reviewed, its two main modes are described, how it can benefit machine learning practitioners is explained, and how to reach the widest possible audience is explained.