Learn More
Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem , we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove(More)
We study how closely the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function. The measurement of closeness is characterized by the loss function used in the estimation. We show that such a classification scheme can be(More)
Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. Typically, they are learned using search heuristics (such as the Baum-Welch / EM algorithm), which suffer from the usual local optima issues. While in general these models are known to be hard to learn with samples from the(More)
One of the most important issues in machine learning is whether one can improve the performance of a supervised learning algorithm by including unlabeled data. Methods that use both labeled and unlabeled data are generally referred to as semi-supervised learning. Although a number of such methods are proposed, at the current stage, we still don't have a(More)
Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper(More)
We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. •(More)
We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such problems often arise in machine learning, known as regularized empirical(More)
A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document vectors from its complement. However, support vector(More)
This paper presents a classifier-combination experimental framework for named entity recognition in which four diverse classi-fiers (robust linear classifier, maximum en-tropy, transformation-based learning, and hidden Markov model) are combined under different conditions. When no gazetteer or other additional training resources are used, the combined(More)