On the Connection between In-sample Testing and Generalization Error

Abstract

This paper pr oves that it is impossible to justify a corre latio n between rep roducti on of a training set and generali zation err or off of the training set using only a pr iori reasoning. As a resu lt , the use in the real world of any genera lizer that fits a hypothesis functi on to a training set (e.g., the use of back-propagation ) is implicitl y pr edic ated on an ass umpt ion abo ut the physical universe. This pap er shows how this ass umpt ion can be expressed in te rms of a non-Euclidean inn er product between two vectors, one represent ing the physical universe and one representing the generalizer. In deriving this result , a novel formalism for address ing mac hine learni ng is developed . T his new formalism can be viewed as an exte nsion of the conventional "Bayesian" formalism, to (among other things). allow one to address the case in which one 's assumed "priors" are not exactly correct . The most impor tant feature of this new formalism is that it uses an ext remely lowlevel event space, consis ting of triples of {target function , hypothesis fun cti on , train ing set }. Partly as a resu lt of this feature, most other form alisms that have been constructed to address machine learn ing (e.g., PAC , the Bayesian formalism , and th e "sta tist ical mechanics" formalism ) are sp ecial cases of the form alism presented in this paper. Consequent ly such formalisms are capable of addressing only a subset of the issues addressed in this pap er. In fact , the formalism of this paper can be used to address all generalization issues of which the author is aware: over-t ra in ing , the need to restrict the number of free parameters in the hypothesis funct ion , th e problems associated wit h a "non-representa tive" training set , whether and when cross-validat ion work s, whether and when stacked genera lizat ion work s, whether and when a particu lar regu lari zer will work , and so forth. A summary of som e of the more important resu lt s of this pap er conce rn ing these and related topics can be found in the conclusion . *Current address: The Sant a Fe Institute, 1660 Old Pecos Trail, Suite A, Santa Fe, NM, 87501. Electronic mail address: dh\/~sfi . santafe. edu 48 David H. Wolp ert

Extracted Key Phrases

Statistics

0510'94'97'00'03'06'09'12'15
Citations per Year

104 Citations

Semantic Scholar estimates that this publication has 104 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Wolpert1992OnTC, title={On the Connection between In-sample Testing and Generalization Error}, author={David H. Wolpert}, journal={Complex Systems}, year={1992}, volume={6} }