Share This Author
Software Framework for Topic Modelling with Large Corpora
This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.
Gensim -- Statistical Semantics in Python
Gensim was created for large digital libraries, but its underlying algorithms for large-scale, distributed, online SVD and LDA are like the Swiss Army knife of data analysis---also useful on their own, outside of the domain of Natural Language Processing.
Subspace Tracking for Latent Semantic Analysis
- Radim Rehurek
- Computer ScienceECIR
- 18 April 2011
This paper introduces a streamed distributed algorithm for incremental SVD updates, and presents experiments measuring numerical accuracy and runtime performance of the algorithm over several data collections, one of which is the whole of the English Wikipedia.
Automated Classification and Categorization of Mathematical Knowledge
Results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM show the F1- measure achieved on classification task of top-level MSC categories exceeds 89%.
Language Identification on the Web: Extending the Dictionary Method
- Radim Rehurek, M. Kolkus
- Computer Science, LinguisticsConference on Intelligent Text Processing and…
- 17 February 2009
A new method is proposed and evaluated, which constructs language models based on word relevance and addresses the limitations of existing approaches when applied to real-world web pages.
The Influence of Preprocessing Parameters on TextCategorization
Results of a large scale study on mutual influence of preprocessing parameters in automated text categorization are presented and analyzed. These parameters include choice of tokenizer, feature…
Scalability of Semantic Analysis in Natural Language Processing
- Radim Rehurek
Prace se zabýva dolovanim dat z rozsahlých korpusů. Zaměřuje se na robustni statisticke metody, ktere dokaži automatizovaně vytvořit kompaktni semantickou reprezentaci volneho textu, tj. bez použiti…
Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
- J. Rygl, Jan Pomikálek, Radim Rehurek, M. Růžička, V. Novotný, Petr Sojka
- Computer ScienceRep4NLP@ACL
- 8 March 2017
This work proposes a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity.
Classification of Multilingual Mathematical Papers in DML-CZ
In the paper, possibilities and experiments done on the data of Czech Digital Mathematics Library, DML-CZ with the goal of developing novel scalable methods of document classification and retrieval of multilingual mathematical papers are discussed.
Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms
- Radim Rehurek
- Computer ScienceArXiv
- 28 February 2011
This paper presents a practical comparison of two such algorithms: a distributed method that operates in a single pass over the input vs. a streamed two-pass stochastic algorithm.