Learn More
We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semi-CRF on an input sequence x outputs a " segmentation " of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements x i of x. Importantly, features for semi-CRFs can measure(More)
In most learning systems examples are represented as xed-length \feature vectors", the components of which are either real numbers or nominal values. We propose an extension of the feature-vector representation that allows the value of a feature to be a set of strings; for instance, to represent a small white and black dog with the nominal features size and(More)
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance met-rics, and hybrid methods. Overall,(More)
We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking is dependent on other documents in the ranking, violating the assumption of independent relevance which(More)
Two recently implemented machine-learning algorithms, <italic>RIPPER</italic>and <italic>sleeping-experts for phrases</italic>, are evaluated on a number of large text categorization problems. These algorithms both construct classifiers that allow the &#8220;context&#8221; of a word <italic>w</italic> to affect how (or even whether) the presence or absence(More)
We present a simple and scalable graph clustering method called power iteration clustering (PIC). PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data. This embedding turns out to be an effective cluster indicator, consistently outperforming widely used spectral(More)
Many existing rule learning systems are computationally expensive on large noisy datasets. In this paper we evaluate the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems. We show that while IREP is extremely eecient, it frequently gives error rates higher than those of C4.5 and C4.5rules. We then propose(More)