Learn More
We present a new approach to cross-language text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents , along with a simple word translation oracle, in order to induce task-specific, cross-lingual word correspondences. We report on analyses that reveal(More)
Genre classification means to discriminate between documents by means of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents' contents. While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea(More)
The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism(More)
We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they(More)
We present an evaluation framework for plagiarism detection. 1 The framework provides performance measures that address the specifics of plagiarism detection , and the PAN-PC-10 corpus, which contains 64 558 artificial and 4 000 simulated plagiarism cases, the latter generated via Amazon's Mechanical Turk. We discuss the construction principles behind the(More)
In the field of information retrieval, clustering algorithms are used to analyze large collections of documents with the objective to form groups of similar documents. Clustering a document collection is an ambiguous task: A clustering, i. e. a set of document groups, depends on the chosen clustering algorithm as well as on the algorithm's parameter(More)
This paper overviews 15 plagiarism detectors that have been evaluated within the fourth international competition on plagiarism detection at PAN'12. We report on their performances for two sub-tasks of external plagiarism detection: candidate document retrieval and detailed document comparison. Furthermore , we introduce the PAN plagiarism corpus 2012, the(More)