Data Set Used
To my father, for curiosity. To my mother, for discipline.
This paper has no novel learning or statistics: it is concerned with making a wide class of pre-existing statistics and learning algorithms com-putationally tractable when faced with data sets with massive numbers of records or attributes. It briefly reviews the static AD-tree structure of Moore and Lee (1998), and offers a new structure with more… (More)
Binary classification is a core data mining task. For large datasets or real-time applications, desirable classifiers are accurate, fast, and need no parameter tuning. We present a simple implementation of logistic regression that meets these requirements. A combination of regularization, truncated Newton methods, and iteratively re-weighted least squares… (More)
Although popular and extremely well established in mainstream statistical data analysis, logistic regression is strangely absent in the field of data mining. There are two possible explanations of this phenomenon. First, there might be an assumption that any tool which can only produce linear classification boundaries is likely to be trumped by more modern… (More)
Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider the problem of link completion, identifying which entities… (More)
Previous work by the authors  demonstrated that logistic regression can be a fast and accurate data mining tool for life sciences datasets, competitive with modern tools like support vector machines and ball-tree based K-NN. This paper has two objectives. The first objective is a serious empirical comparison of logistic regression to several classical… (More)
There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g.… (More)