# On the effect of data set size on bias and variance in classification learning

@inproceedings{Brain1999OnTE, title={On the effect of data set size on bias and variance in classification learning}, author={Damien Brain and Geoffrey I. Webb}, year={1999} }

With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. [...] Key Result These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms. Expand

#### Figures, Tables, and Topics from this paper

#### 68 Citations

Making Early Predictions of the Accuracy of Machine Learning Applications

- Computer Science, Mathematics
- ArXiv
- 2012

This paper hypothesise that if a number of classifiers are produced, and their observed error is decomposed into bias and variance terms, then although these components may behave differently, their behaviour may be predictable. Expand

Making Early Predictions of the Accuracy of Machine Learning Classifiers

- Computer Science
- 2012

This chapter hypothesizes that if a number of classifiers are produced, and their observed error is decomposed into bias and variance terms, then although these components may behave differently, their behavior may be predictable, and investigates techniques for making such early predictions. Expand

Learning with few examples: An empirical study on leading classifiers

- Computer Science
- The 2011 International Joint Conference on Neural Networks
- 2011

The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases) to study the ability of algorithms to produce models with only few examples. Expand

Tree Induction Vs Logistic Regression: A Learning Curve Analysis

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2003

A large-scale experimental comparison of logistic regression and tree induction is presented, assessing classification accuracy and the quality of rankings based on class-membership probabilities, and a learning-curve analysis is used to examine the relationship of these measures to the size of the training set. Expand

Concept-drifting Data Streams are Time Series; The Case for Continuous Adaptation

- Computer Science, Mathematics
- ArXiv
- 2018

It is shown that Hoeffding-tree based ensembles are not naturally suited to learning within concept drift; and can perform in this scenario only at significant computational cost of destructive adaptation; and gradient-descent methods are developed and parameterized, demonstrating how they can perform continuous adaptation with no explicit drift-detection mechanism. Expand

DO NOT DISTURB? Classifier Behavior on Perturbed Datasets

- Computer Science
- CD-MAKE
- 2017

A surprising conclusion of those experiments is the fact that classification on an anonymized dataset with outliers removed in beforehand can almost compete with classification on the original, un-anonymized dataset, which could soon lead to competitive Machine Learning pipelines on anonymized datasets for real-world usage in the marketplace. Expand

Scalable Learning of Bayesian Network Classifiers

- Computer Science
- J. Mach. Learn. Res.
- 2016

This paper proposes an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub-model of a full KDB classifier that requires only one additional pass through the training data, making it a three-pass learner. Expand

Preconditioning an Artificial Neural Network Using Naive Bayes

- Computer Science
- PAKDD
- 2016

It is shown that this NB preconditioning can speed-up convergence significantly and that optimizing a linear model with MSE leads to a lower bias classifier than optimizing with CLL. Expand

Stop Wasting Time: On Predicting the Success or Failure of Learning for Industrial Applications

- Computer Science
- IDEAL
- 2007

The successful application of machine learning techniques to industrial problems places various demands on the collaborators. The system designers must possess appropriate analytical skills andâ€¦ Expand

Different Approaches to Reducing Bias in Classification of Medical Data by Ensemble Learning Methods

- Computer Science
- 2021

In this study, different models were created to reduce bias by ensemble learning methods and methods based on stacking displayed a higher performance compared to other methods. Expand

#### References

SHOWING 1-10 OF 14 REFERENCES

Bias, Variance , And Arcing Classifiers

- Computer Science
- 1996

This work explores two arcing algorithms, compares them to each other and to bagging, and tries to understand how arcing works, which is more sucessful than bagging in variance reduction. Expand

Experiments with a New Boosting Algorithm

- Computer Science
- ICML
- 1996

This paper describes experiments carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems and compared boosting to Breiman's "bagging" method when used to aggregate various classifiers. Expand

C4.5: Programs for Machine Learning

- Computer Science
- 1992

A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Expand

Boosting the margin: A new explanation for the effectiveness of voting methods

- Mathematics, Computer Science
- ICML
- 1997

It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. Expand

Programs for Machine Learning

- Computer Science
- 1994

In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students. Expand

Peepholing: Choosing Attributes Efficiently for Megainduction

- Mathematics, Computer Science
- ML
- 1992

Empirical evaluations suggest this new method of speeding up the induction of decision trees from large noisy domains with continuous attributes is several times faster than the basic ID3 algorithm on training sets of tens of thousands of examples, and for very large sets reduces learning time from a superlinear to an approximately linear function of the number of examples. Expand

An Analysis of Bayesian Classifiers

- Computer Science
- AAAI
- 1992

An average-case analysis of the Bayesian classifier, a simple induction algorithm that fares remarkably well on many learning tasks, and explores the behavioral implications of the analysis by presenting predicted learning curves for artificial domains. Expand

Error-Correcting Output Coding Corrects Bias and Variance

- Computer Science
- ICML
- 1995

An investigation of why the ECOC technique works, particularly when employed with decision-tree learning algorithms, shows that it can reduce the variance of the learning algorithm. Expand

Bias Plus Variance Decomposition for Zero-One Loss Functions

- Computer Science
- ICML
- 1996

It is shown that in practice the naive frequency based estimation of the decompo sition terms is by itself biased and how to correct for this bias is correct. Expand

Scaling Up Inductive Algorithms: An Overview

- Computer Science
- KDD
- 1997

Common ground is established for researchers addressing the challenge of scaling up inductive data mining algorithms to very large databases, and for practitioners who want to understand the state of the art. Expand