Learn More
There is a common Mathematics Subject Classification (MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1-measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and(More)
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of(More)
— Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning non-trivial internal parameters. Although(More)
With the explosion of the size of digital dataset, the limiting factor for decomposition algorithms is the number of passes over the input, as the input is often stored out-of-core or even off-site. Moreover, we're only interested in algorithms that operate in constant memory w.r.t. to the input size, so that arbitrarily large input can be processed. In(More)
Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability,(More)
  • 1