• Corpus ID: 2759724

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

@inproceedings{Tarvainen2017MeanTA,
  title={Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results},
  author={Antti Tarvainen and Harri Valpola},
  booktitle={NIPS},
  year={2017}
}
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. [] Key Method As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance…

Figures and Tables from this paper

Unsupervised Data Augmentation for Consistency Training
TLDR
A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.
Smooth Neighbors on Teacher Graphs for Semi-Supervised Learning
TLDR
A novel method, called Smooth Neighbors on Teacher Graphs (SNTG), which serves as a similarity measure with respect to which the representations of "similar" neighboring points are learned to be smooth on the low-dimensional manifold and achieves state-of-the-art results on semi-supervised learning benchmarks.
Unsupervised Domain Adaptation using Generative Models and Self-ensembling
TLDR
The results suggest that selfensembling is better than simple data augmentation with the newly generated data and a single model trained this way can have the best performance across all different transfer tasks.
SELF: Learning to Filter Noisy Labels with Self-Ensembling
TLDR
This work presents a simple and effective method self-ensemble label filtering (SELF) to progressively filter out the wrong labels during training that substantially outperforms all previous works on noise-aware learning across different datasets and can be applied to a broad set of network architectures.
Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones
TLDR
This paper proposes to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models by only driving prediction of the student model consistent with that of the teacher model, and finds that such simple distillation settings perform extremely effective.
AdaReNet: Adaptive Reweighted Semi-supervised Active Learning to Accelerate Label Acquisition
TLDR
This work takes a holistic approach to label acquisition and considers the expansion of clean and pseudo-labeled subsets jointly and introduces a collaborative teacher-student framework, where the teacher learns a data-driven curriculum.
Improving Consistency-Based Semi-Supervised Learning with Weight Averaging
TLDR
It is shown that consistency regularization leads to flatter but narrower optima for semi-supervised models, and that with fast-SWA the simple $\Pi$ model becomes state-of-the-art for large labeled settings.
Source Target Model Weak Strong Batches Model Random Logit Interpolation
TLDR
AdaMatch, a unified solution for unsupervised domain adaptation, is introduced and it is found that AdaMatch either matches or significantly exceeds the state-of-the-art in each case using the same hyper-parameters regardless of the dataset or task.
Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation
TLDR
This paper addresses the prediction accuracy problem of consistency learning methods with novel extensions of the mean-teacher model, which include a new auxiliary teacher, and the replacement of MT’s mean square error (MSE) by a stricter confidence-weighted cross-entropy (Conf-CE) loss.
When Semi-Supervised Learning Meets Transfer Learning: Training Strategies, Models and Datasets
TLDR
This study comprehensively study how SSL methods starting from pretrained models perform under varying conditions, including training strategies, architecture choice and datasets, and demonstrates that the gains from SSL techniques over a fully-supervised baseline are smaller when training from a pre-trained model than when trained from random initialization.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 48 REFERENCES
Temporal Ensembling for Semi-Supervised Learning
TLDR
Self-ensembling is introduced, where it is shown that this ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training.
Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning
TLDR
An unsupervised loss function is proposed that takes advantage of the stochastic nature of these methods and minimizes the difference between the predictions of multiple passes of a training sample through the network.
Swapout: Learning an ensemble of deep architectures
TLDR
This work describes Swapout, a new stochastic training method that outperforms ResNets of identical network structure yielding impressive results on CIFAR-10 and CIFar-100 and proposes a parameterization that reveals connections to exiting architectures and suggests a much richer set of architectures to be explored.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Shake-Shake regularization
The method introduced in this paper aims at helping deep learning practitioners faced with an overfit problem. The idea is to replace, in a multi-branch network, the standard summation of parallel
Deep Networks with Stochastic Depth
TLDR
Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation.
Aggregated Residual Transformations for Deep Neural Networks
TLDR
On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Variational Autoencoder for Deep Learning of Images, Labels and Captions
TLDR
A novel variational autoencoder is developed to model images, as well as associated labels or captions, and a new semi-supervised setting is manifested for CNN learning with images; the framework even allows unsupervised CNN learning, based on images alone.
On Calibration of Modern Neural Networks
TLDR
It is discovered that modern neural networks, unlike those from a decade ago, are poorly calibrated, and on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
...
1
2
3
4
5
...