George H. John

Learn More
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the(More)
We address the problem of nding a subset of features that allows a supervised induc tion algorithm to induce small high accuracy concepts We examine notions of relevance and irrelevance and show that the de nitions used in the machine learning literature do not adequately partition the features into useful categories of relevance We present de ni tions for(More)
As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of data-mining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To this(More)
Finding and removing outliers is an important problem in data mining. Errors in large databases can be extremely common, so an important property of a data mining algorithm is robustness with respect to errors in the database. Most sophisticated methods in machine learning address this problem to some extent, but not fully, and can be improved by addressing(More)
We present MLC ++ , a library of C ++ classes and tools for supervised Machine Learning. While MLC ++ provides general learning algorithms that can be used by end users, the main objective is to provide researchers and experts with a wide variety of tools that can accelerate algorithm development, increase software reliability, provide comparison tools, and(More)
When mining large databases, the data extraction problem and the interface between the database and data mining algorithm become important issues. Rather than giving a mining algorithm full access to a database (by extracting to a flat file or other directlyaccessible data structure), we propose the SQL Interface Protocol (SIP), which is a framework for(More)
We present a new method for the induction of classiication trees with linear dis-criminants as the partitioning function at each internal node. This paper presents two main contributions: rst, a novel objective function called soft entropy which is used to identify optimal coeecients for the linear discriminants, and second, a novel method for removing(More)