Universal indexes for highly repetitive document collections
@article{Claude2016UniversalIF, title={Universal indexes for highly repetitive document collections}, author={Francisco Claude and Antonio Fari{\~n}a and Miguel A. Mart{\'i}nez-Prieto and Gonzalo Navarro}, journal={ArXiv}, year={2016}, volume={abs/1604.08897} }
Figures and Tables from this paper
33 Citations
Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures
- Computer Science
- 2020
This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.
Indexing Highly Repetitive String Collections, Part II: Compressed Indexes
- Computer Science
- 2020
This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.
Indexing Highly Repetitive Collections via Grammar Compression
- Computer Science
- 2019
This proposal will focus on the main drawbacks of Grammar-based compressors and self-indexes in repetitive collections.
Indexing Highly Repetitive String Collections, Part I
- Computer ScienceACM Comput. Surv.
- 2022
This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.
Indexing Highly Repetitive String Collections
- Computer ScienceArXiv
- 2020
This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that has been proposed, comparing them both in theoretical and practical aspects.
Indexing Highly Repetitive String Collections, Part II
- Computer ScienceACM Comput. Surv.
- 2022
This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.
Hybrid compression of inverted lists for reordered document collections
- Computer ScienceInf. Process. Manag.
- 2018
Universal Compressed Text Indexing 1
- Computer Science
- 2018
This paper develops the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can be built on top of any dictionary-compressed text representation, and shows that the relation between indexing and compression is much deeper than what was previously thought.
Compressed Computation for Text Indexing
- Computer Science
- 2017
This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).
References
SHOWING 1-10 OF 77 REFERENCES
Improved index compression techniques for versioned document collections
- Computer ScienceCIKM
- 2010
This paper proposes new index compression techniques for versioned document collections that achieve reductions in index size over previous methods, and first proposes several bitwise compression techniques that achieve a compact index structure but that are too slow for most applications.
Optimizing positional index structures for versioned document collections
- Computer ScienceSIGIR '12
- 2012
A framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing is proposed and the problem of minimizing positional index size through optimal substring partitioning is studied.
Storage and Retrieval of Highly Repetitive Sequence Collections
- BiologyJ. Comput. Biol.
- 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.
Document Listing on Repetitive Collections
- Computer ScienceCPM
- 2013
This paper shows how one of those indexes, the run-length compressed suffix array (RLCSA), can be extended to support document listing, and develops a new document listing technique for general collections that is of independent interest.
Compact full-text indexing of versioned document collections
- Computer ScienceCIKM
- 2009
This paper proposes new techniques for organizing and compressing inverted index structures for versioned document collections, that is, collections that contain multiple versions of each document.
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences
- Computer Science2010 IEEE International Conference on BioInformatics and BioEngineering
- 2010
This paper study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams), and introduces two novel techniques that constitute practical alternatives to handle this scenario.
Composite Repetition-Aware Data Structures
- Computer ScienceCPM
- 2015
Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.
Practical Rank/Select Queries over Arbitrary Sequences
- Computer ScienceSPIRE
- 2008
A new practical implementation of the compressed representation for bit sequences proposed by Raman,Raman, and Rao is presented, that is competitive with the existing ones when the sequences are not too compressible and has nice local compression properties.
Efficient search in large textual collections with redundancy
- Computer ScienceWWW '07
- 2007
This paper proposes a general framework for indexing and query processing of archival collections and, more generally, any collections with a sufficient amount of redundancy, and results in significant reductions in index size and queryprocessing costs on such collections.