Learn More
We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semi-CRF on an input sequence x outputs a " segmentation " of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements x i of x. Importantly, features for semi-CRFs can measure(More)
In most learning systems examples are represented as xed-length \feature vectors", the components of which are either real numbers or nominal values. We propose an extension of the feature-vector representation that allows the value of a feature to be a set of strings; for instance, to represent a small white and black dog with the nominal features size and(More)
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance met-rics, and hybrid methods. Overall,(More)
We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking is dependent on other documents in the ranking, violating the assumption of independent relevance which(More)
We present a simple and scalable graph clustering method called power iteration clustering (PIC). PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data. This embedding turns out to be an effective cluster indicator, consistently outperforming widely used spectral(More)
Two recently implemented machine-learning algorithms, <italic>RIPPER</italic>and <italic>sleeping-experts for phrases</italic>, are evaluated on a number of large text categorization problems. These algorithms both construct classifiers that allow the &#8220;context&#8221; of a word <italic>w</italic> to affect how (or even whether) the presence or absence(More)
We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance met-rics on the task of matching entity names. These met-rics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based(More)