#### Filter Results:

- Full text PDF available (9)

#### Publication Year

2007

2017

- This year (2)
- Last 5 years (2)
- Last 10 years (11)

#### Publication Type

#### Co-author

#### Journals and Conferences

#### Key Phrases

Learn More

- Radim Rehurek, Petr Sojka
- 2010

Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which… (More)

- Radim Rehurek
- ECIR
- 2011

Modern applications of Latent Semantic Analysis (LSA) must deal with enormous (often practically infinite) data collections, calling for a single-pass matrix decomposition algorithm that operates in constant memory w.r.t. the collection size. This paper introduces a streamed distributed algorithm for incremental SVD updates. Apart from the theoretical… (More)

Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although… (More)

- Radim Rehurek, Petr Sojka
- AISC/MKM/Calculemus
- 2008

There is a common Mathematics Subject Classification (MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1-measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and… (More)

- Radim Rehurek, Milan Kolkus
- CICLing
- 2009

Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of these… (More)

- Radim Rehurek
- ArXiv
- 2011

With the explosion of the size of digital dataset, the limiting factor for decomposition algorithms is the number of passes over the input, as the input is often stored out-of-core or even off-site. Moreover, we’re only interested in algorithms that operate in constant memory w.r.t. to the input size, so that arbitrarily large input can be processed. In… (More)

- Jan Rygl, Jan Pomikálek, Radim Rehurek, Michal Ruzicka, Vít Novotný, Petr Sojka
- ArXiv
- 2017

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability,… (More)

- Radim Rehurek
- 2007

In this paper we propose features desirable of linear text segmentation algorithms for the Information Retrieval domain, with emphasis on improving high similarity search of heterogeneous texts. We proceed to describe a robust purely statistical method, based on context overlap exploitation, that exhibits these desired features. Ways to automatically… (More)

- Radim Rehurek
- EPIA Workshops
- 2007

- Radim Rehurek
- ICAART
- 2011