Radim Rehurek

Learn More
Modern applications of Latent Semantic Analysis (LSA) must deal with enormous (often practically infinite) data collections, calling for a single-pass matrix decomposition algorithm that operates in constant memory w.r.t. the collection size. This paper introduces a streamed distributed algorithm for incremental SVD updates. Apart from the theoretical(More)
There is a common Mathematics Subject Classification (MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1-measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and(More)
Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of these(More)
With the explosion of the size of digital dataset, the limiting factor for decomposition algorithms is the number of passes over the input, as the input is often stored out-of-core or even off-site. Moreover, we’re only interested in algorithms that operate in constant memory w.r.t. to the input size, so that arbitrarily large input can be processed. In(More)
Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability,(More)