Learn More
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of(More)
There is a common Mathematics Subject Classification (MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1-measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and(More)
— Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning non-trivial internal parameters. Although(More)
With the explosion of the size of digital dataset, the limiting factor for decomposition algorithms is the number of passes over the input, as the input is often stored out-of-core or even off-site. Moreover, we're only interested in algorithms that operate in constant memory w.r.t. to the input size, so that arbitrarily large input can be processed. In(More)
  • 1