An algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks and discusses how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance.Expand

A novel message passing algorithm for approximating the MAP problem in graphical models that is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights.Expand

This work proposes to solve the cluster selection problem monotonically in the dual LP, iteratively selecting clusters with guaranteed improvement, and quickly re-solving with the added clusters by reusing the existing solution.Expand

A formal definition of the general continuous IB problem is given and an analytic solution for the optimal representation for the important case of multivariate Gaussian variables is obtained, in terms of the eigenvalue spectrum.Expand

A new algorithm for avoiding single feature over-weighting is introduced by analyzing robustness using a game theoretic formalization, and classifiers which are optimally resilient to deletion of features in a minimax sense are developed.Expand

This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics, and shows that it consistently and significantly outperforms standard methods of statistical correspondence modeling.Expand

It is shown how partition functions and marginals for directed spanning trees can be computed by an adaptation of Kirchhoff’s Matrix-Tree Theorem by using the algorithm in training both log-linear and max-margin dependency parsers.Expand

This work provides the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations, and shows that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time.Expand

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.Expand

This work proposes to solve the combinatorial problem ofding the highest scoring Bayesian network structure from data by maintaining an outer bound approximation to the polytope and iteratively tighten it by searching over a new class of valid constraints.Expand