A repetition based measure for verification of text collections and for text categorization

@article{Khmelev2003ARB,
  title={A repetition based measure for verification of text collections and for text categorization},
  author={D. Khmelev and W. Teahan},
  journal={Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval},
  year={2003}
}
  • D. Khmelev, W. Teahan
  • Published 2003
  • Computer Science
  • Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they… Expand
Text Augmentation: Inserting markup into natural language text with PPM Models
TLDR
A new optimisation and new heuristics for automatically marking up XML documents are implemented in CEM, using PPM models, significantly more general than previous systems. Expand
Legal documents categorization by compression
TLDR
Far from having found a silver bullet, it is shown that compression-based techniques provide the best results for the problem at hand, and argued that these approaches can be effectively coupled with more informative and semantically grounded ones. Expand
Verifying a Chinese collection for text categorization
TLDR
A novel retrieval-based approach was developed to detect duplicates and label inconsistency in this corpus and in Reuters-21578 for comparison and experiments showed that effectiveness was not affected by the confusing documents. Expand
Plagiarism detection using stopword n-grams
TLDR
It is shown that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Expand
On redundancy of training corpus for text categorization: a perspective of geometry
TLDR
This paper study the redundancy of training corpus in the context of kNN text categorization, aim to explore how to judge whether a training corpus has redundancy and how to reduce the redundancy if it has, and develop a redundancy reduction algorithm. Expand
RCV1: A New Benchmark Collection for Text Categorization Research
TLDR
This work describes the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. Expand
On Compression-Based Text Classification
TLDR
This work presents the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text, and specifically designed to test whether the ability to capture non-word features causes character- based text compression methods to achieve more accurate classification. Expand
N-Gram Feature Selection for Authorship Identification
TLDR
This paper proposes a variable-length n-gram approach inspired by previous work for selectingVariable-length word sequences and explores the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing. Expand
Author Identification Using Imbalanced and Limited Training Texts
  • E. Stamatatos
  • Computer Science
  • 18th International Workshop on Database and Expert Systems Applications (DEXA 2007)
  • 2007
TLDR
The results show that CNG with the proposed distance measures is more accurate when only limited training text samples are available, at least for some of the candidate authors, a realistic condition in author identification problems. Expand
Author Identification Using Imbalanced and Limited Training Texts
TLDR
The results show that CNG with the proposed distance measures is more accurate when only limited training text samples are available, at least for some of the candidate authors, a realistic condition in author identification problems. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 38 REFERENCES
Text categorization using compression models
TLDR
Test categorization is the assignment of natural language texts to predefined categories based on their concept to provide an overall judgement on the document as a whole, rather than discarding information by pre-selecting features. Expand
Text classification and segmentation using minimum cross-entropy
TLDR
Experimental results show that the methods are a significant improvement over previously used methods in a number of areas, for example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Expand
An algorithm for suffix stripping
TLDR
An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL and performs slightly better than a much more elaborate system with which it has been compared. Expand
Authorship Attribution with Support Vector Machines
TLDR
The support vector machine (SVM) is applied to the use of text-mining methods for the identification of the author of a text, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of atext. Expand
A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory
TLDR
A new external-memory algorithm is devised that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance. Expand
Using compression based language models for text categorization.
TLDR
Two approaches to compression-based categorization are presented, one based on ranking by documentCross entropy (average bits per coded symbol) with respect to a category model, and the other based on document cross entropy difference between category and complement of category models. Expand
Using Literal and Grammatical Statistics for Authorship Attribution
TLDR
It turns out that the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text are rather stable characteristics of an author and, apparently, they could be used in disputed authorship attribution. Expand
Duplicate Detection in the Reuters Collection
TLDR
The results of this study revealed that the notion of a duplicate document was not as simple as first thought and a review of previous duplicate detection research will be presented. Expand
Suffix arrays: a new method for on-line string searches
TLDR
A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications. Expand
The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources
TLDR
The origins of RCV1, the motivations behind its creation, and how it differs from previous corpora are described, and the system of category coding, whereby each story is annotated for topic, region and industry sector is discussed. Expand
...
1
2
3
4
...