147 Citations
Compressed Computation for Text Indexing
- Computer Science
- 2017
This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).
Universal Compressed Text Indexing 1
- Computer Science
- 2018
This paper develops the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can be built on top of any dictionary-compressed text representation, and shows that the relation between indexing and compression is much deeper than what was previously thought.
Block Tree based Universal Self-Index for Repetitive Text Collections
- Computer Science
- 2020
Being able to manipulate the text within compressed space, with a compression related to its repetitiveness has a critical importance in many areas of study such as Bioinformatics, Information Retrieval, Data Mining, among others.
Indexing Highly Repetitive Collections via Grammar Compression
- Computer Science
- 2019
This proposal will focus on the main drawbacks of Grammar-based compressors and self-indexes in repetitive collections.
Flexible Indexing of Repetitive Collections
- Computer ScienceCiE
- 2017
Practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text are described, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors.
A compressed dynamic self-index for highly repetitive text collections
- Computer ScienceInf. Comput.
- 2020
CHICO: A Compressed Hybrid Index for Repetitive Collections
- Computer ScienceSEA
- 2016
This paper presents an implementation of an hybrid index that combines the effectiveness of Lempel-Ziv factorization with a modular design, and is able to successfully index thousands of genomes in a commodity desktop, and it scales up to multi-terabyte collections, provided there is enough secondary memory.
Indexing Highly Repetitive Collections
- Biology, Computer ScienceIWOCA
- 2012
Progress made along three research lines to address the need to index and search huge highly repetitive sequence collections are described: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.
References
SHOWING 1-10 OF 76 REFERENCES
Self-indexing Based on LZ77
- Computer ScienceCPM
- 2011
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes…
LZ77-Like Compression with Fast Random Access
- Computer Science2010 Data Compression Conference
- 2010
This work introduces an alternative Lempel-Ziv text parsing, LZ-End, that converges to the entropy and in practice gets very close to LZ77, which is ideal as a compression format for highly repetitive sequence databases, where access to individual sequences is required.
Indexes for highly repetitive document collections
- Computer ScienceCIKM '11
- 2011
We introduce new compressed inverted indexes for highly repetitive document collections. They are based on run-length, Lempel-Ziv, or grammar-based compression of the differential inverted lists,…
Self-Index Based on LZ77
- Computer ScienceArXiv
- 2011
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes…
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences
- Computer Science2010 IEEE International Conference on BioInformatics and BioEngineering
- 2010
This paper study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams), and introduces two novel techniques that constitute practical alternatives to handle this scenario.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections
- Computer ScienceSPIRE
- 2008
It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.
Stronger Lempel-Ziv Based Compressed Text Indexing
- Computer ScienceAlgorithmica
- 2010
Stronger Lempel-Ziv based indices (LZ-indices) are presented, improving the overall performance of the original LZ-index and achieving indices requiring (2+ε)uHk(T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ -indices.
Storage and Retrieval of Highly Repetitive Sequence Collections
- BiologyJ. Comput. Biol.
- 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.
Repetition-Based Text Indexes
- Computer Science
- 1999
A new repetition-based q-gram index, the Lempel{Ziv index forq-grams, that has asymptotically optimal space requirement and query time provided that q is a constant or grows slowly enough with respect to the length of the text.