Stephen D. Bay

Learn More
Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in(More)
A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that(More)
Combining multiple classiiers is an eeective technique for improving accuracy. There are many general combining algorithms, such as Bagging, Boosting, or Error Correcting Output Coding, that signiicantly improve classiiers like decision trees, rule learners, or neural networks. Unfortunately, these combining methods do not improve the nearest neighbor(More)
% $ & ! ' ($ )! * ! * + $, * . & . * /0 $1* 2 * . *.! 3 4 & ! * 56 $) * * # $ 7/0 ! 8 & $8 $ * 4 * $ 9* :0$ "<;0 8 8 $( *, 6$ & * 4 6* :' /= 2 ! > * * ' 4* : ?! * > 8 ! @ # ':A * B 8& * :4C $ ) # ! + . ! '! * # 6 ! ! 4 $( 4 2 4 ! ) *? 5 ! 8 4 :A# ' * /0 $ ?:A * < 6$ " ;D*? # * . ! ( ( 8 E>FHGA & ) 8! $ 9E>F G7I9 * /0 $ 9J ! * & 4 .J9 >K I>J9J>L
Many algorithms in data mining can be formulated as a set-mining problem where the goal is to find conjunctions (or disjunctions) of terms that meet user-specified constraints. Set-mining techniques have been largely designed for categorical or discrete data where variables can only take on a fixed number of values. However, many datasets also contain(More)
Discovering the complex regulatory networks that govern mRNA expression is an important but difficult problem. Many current approaches use only expression data from microarrays to infer the likely network structure. However, this ignores much existing knowledge because for a given organism and system under study, a biologist may already have a partial model(More)