Corpus ID: 2949558

On the effect of data set size on bias and variance in classification learning

@inproceedings{Brain1999OnTE,
  title={On the effect of data set size on bias and variance in classification learning},
  author={Damien Brain and Geoffrey I. Webb},
  year={1999}
}
With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. [...] Key Result These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms.Expand
Making Early Predictions of the Accuracy of Machine Learning Applications
TLDR
This paper hypothesise that if a number of classifiers are produced, and their observed error is decomposed into bias and variance terms, then although these components may behave differently, their behaviour may be predictable. Expand
Making Early Predictions of the Accuracy of Machine Learning Classifiers
TLDR
This chapter hypothesizes that if a number of classifiers are produced, and their observed error is decomposed into bias and variance terms, then although these components may behave differently, their behavior may be predictable, and investigates techniques for making such early predictions. Expand
Learning with few examples: An empirical study on leading classifiers
TLDR
The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases) to study the ability of algorithms to produce models with only few examples. Expand
Tree Induction Vs Logistic Regression: A Learning Curve Analysis
TLDR
A large-scale experimental comparison of logistic regression and tree induction is presented, assessing classification accuracy and the quality of rankings based on class-membership probabilities, and a learning-curve analysis is used to examine the relationship of these measures to the size of the training set. Expand
Concept-drifting Data Streams are Time Series; The Case for Continuous Adaptation
  • J. Read
  • Computer Science, Mathematics
  • ArXiv
  • 2018
TLDR
It is shown that Hoeffding-tree based ensembles are not naturally suited to learning within concept drift; and can perform in this scenario only at significant computational cost of destructive adaptation; and gradient-descent methods are developed and parameterized, demonstrating how they can perform continuous adaptation with no explicit drift-detection mechanism. Expand
DO NOT DISTURB? Classifier Behavior on Perturbed Datasets
TLDR
A surprising conclusion of those experiments is the fact that classification on an anonymized dataset with outliers removed in beforehand can almost compete with classification on the original, un-anonymized dataset, which could soon lead to competitive Machine Learning pipelines on anonymized datasets for real-world usage in the marketplace. Expand
Scalable Learning of Bayesian Network Classifiers
TLDR
This paper proposes an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub-model of a full KDB classifier that requires only one additional pass through the training data, making it a three-pass learner. Expand
Preconditioning an Artificial Neural Network Using Naive Bayes
TLDR
It is shown that this NB preconditioning can speed-up convergence significantly and that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. Expand
Stop Wasting Time: On Predicting the Success or Failure of Learning for Industrial Applications
The successful application of machine learning techniques to industrial problems places various demands on the collaborators. The system designers must possess appropriate analytical skills andExpand
Different Approaches to Reducing Bias in Classification of Medical Data by Ensemble Learning Methods
TLDR
In this study, different models were created to reduce bias by ensemble learning methods and methods based on stacking displayed a higher performance compared to other methods. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 14 REFERENCES
Bias, Variance , And Arcing Classifiers
TLDR
This work explores two arcing algorithms, compares them to each other and to bagging, and tries to understand how arcing works, which is more sucessful than bagging in variance reduction. Expand
Experiments with a New Boosting Algorithm
TLDR
This paper describes experiments carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems and compared boosting to Breiman's "bagging" method when used to aggregate various classifiers. Expand
C4.5: Programs for Machine Learning
TLDR
A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Expand
Boosting the margin: A new explanation for the effectiveness of voting methods
TLDR
It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. Expand
Programs for Machine Learning
TLDR
In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students. Expand
Peepholing: Choosing Attributes Efficiently for Megainduction
TLDR
Empirical evaluations suggest this new method of speeding up the induction of decision trees from large noisy domains with continuous attributes is several times faster than the basic ID3 algorithm on training sets of tens of thousands of examples, and for very large sets reduces learning time from a superlinear to an approximately linear function of the number of examples. Expand
An Analysis of Bayesian Classifiers
TLDR
An average-case analysis of the Bayesian classifier, a simple induction algorithm that fares remarkably well on many learning tasks, and explores the behavioral implications of the analysis by presenting predicted learning curves for artificial domains. Expand
Error-Correcting Output Coding Corrects Bias and Variance
TLDR
An investigation of why the ECOC technique works, particularly when employed with decision-tree learning algorithms, shows that it can reduce the variance of the learning algorithm. Expand
Bias Plus Variance Decomposition for Zero-One Loss Functions
TLDR
It is shown that in practice the naive frequency based estimation of the decompo sition terms is by itself biased and how to correct for this bias is correct. Expand
Scaling Up Inductive Algorithms: An Overview
TLDR
Common ground is established for researchers addressing the challenge of scaling up inductive data mining algorithms to very large databases, and for practitioners who want to understand the state of the art. Expand
...
1
2
...