• Corpus ID: 219635787

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

  title={Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks},
  author={Like Hui and Mikhail Belkin},
Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in… 
Rank4Class: A Ranking Formulation for Multiclass Classification
This paper argues that ranking metrics, such as Normalized Discounted Cumulative Gain (NDCG), can be more informative than existing Top-K metrics and demonstrates that the dominant neural MCC architecture can be formulated as a neural ranking framework with a specific set of design choices.
On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features
This work provides the first global landscape analysis for vanilla nonconvex MSE loss and shows that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions.
Understanding Square Loss in Training Overparametrized Neural Network Classifiers
This work systematically investigates how square loss performs for overparametrized neural networks in the neural tangent kernel (NTK) regime, demonstrating the effectiveness of square loss in both synthetic low-dimensional data and real image data.
Generalization for multiclass classification with overparameterized linear models
It turns out that the key difference from the binary classification setting is that there are relatively fewer positive training examples of each class in the multiclass setting as the number of classes increases, making the multiclasses problem “harder” than the binary one.
Deep Classifiers trained with the Square Loss
It is shown that convergence to a solution with the absolute minimum ρ is expected when normalization by a Lagrange multiplier is used together with Weight Decay, and it is proved that SGD converges to solutions that have a bias towards 1) large margin (i.e. small ρ) and 2) low rank of the weight matrices.
Soft Calibration Objectives for Neural Networks
Overall, experiments across losses and datasets demonstrate that using calibrationsensitive procedures yield better uncertainty estimates under dataset shift than the standard practice of using a cross-entropy loss and post-hoc recalibration methods.
Dynamics and Neural Collapse in Deep Classifiers trained with the Square Loss
It is proved that quasi-interpolating solutions obtained by gradient descent in the presence of WD are expected to show the recently discovered behavior of Neural Collapse and describe other predictions of the theory.
LQF: Linear Quadratic Fine-Tuning
This work presents the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning on most of real-world image classification tasks tested, thus enjoying the interpretability of linear models without incurring punishing losses in performance.
Introducing One Sided Margin Loss for Solving Classification Problems in Deep Networks
Using OSM loss leads to faster training speeds and better accuracies than binary and categorical cross-entropy in several commonly used deep models for classification and optical character recognition problems, and the accuracies are rather better than cross entropy and hinge loss for large networks.
Standalone Neural ODEs with Sensitivity Analysis
This paper presents the Standalone Neural ODE (sNODE), a continuous-depth neural ODE model capable of describing a full deep neural network. This uses a novel nonlinear conjugate gradient (NCG)


Automatically Constructing a Corpus of Sentential Paraphrases
The creation of the recently-released Microsoft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase, is described.
  • P. Alam
  • Composites Engineering: An A–Z Guide
  • 2021
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
[Et al].
disasters. Plenum, 2001. 11. Haley R, Thomas L, Hom J. Is there a Gulf War Syndrome? Searching for syndromes by factor analysis of symptoms. JAMA 1997;277:215–22. 12. Fukuda K, Nisenbaum R, Stewart
The Design for the Wall Street Journal-based CSR Corpus
This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus, a corpus containing significant quantities of both speech data and text data.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
A systematic evaluation of generic convolutional and recurrent architectures for sequence modeling concludes that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutionals should be regarded as a natural starting point for sequence modeled tasks.
Regression Modeling Strategies with Applications to Linear Models
This paper presents a meta-modelling procedure called Cox Proportional Hazards Regression Model, which automates the very labor-intensive and therefore time-heavy and expensive process of rebuilding a linear model from scratch.
Gradient-based learning applied to document recognition
This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.
Classification vs regression in overparameterized regimes: Does the loss function matter?
This work compares classification and regression tasks in the overparameterized linear model with Gaussian features and demonstrates the very different roles and properties of loss functions used at the training phase (optimization) and the testing phase (generalization).