Text Indexing and Searching in Sublinear Time

@inproceedings{Munro2020TextIA,
  title={Text Indexing and Searching in Sublinear Time},
  author={J. Ian Munro and Gonzalo Navarro and Yakov Nekrich},
  booktitle={CPM},
  year={2020}
}
We introduce the first index that can be built in $o(n)$ time for a text of length $n$, and also queried in $o(m)$ time for a pattern of length $m$. On a constant-size alphabet, for example, our index uses $O(n\log^{1/2+\varepsilon}n)$ bits, is built in $O(n/\log^{1/2-\varepsilon} n)$ deterministic time, and finds the $\mathrm{occ}$ pattern occurrences in time $O(m/\log n + \sqrt{\log n}\log\log n + \mathrm{occ})$, where $\varepsilon>0$ is an arbitrarily small constant. As a comparison, the… 
Fast Preprocessing for Optimal Orthogonal Range Reporting and Range Successor with Applications to Text Indexing
TLDR
This work is the first that achieve the same preprocessing time for optimal orthogonal range reporting and range successor, and also applies the results to improve the construction time of text indexes.
Faster Algorithms for Longest Common Substring
TLDR
An O(n logk−1/2 n)-time algorithm is shown, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al.
Internal Shortest Absent Word Queries
TLDR
An O((n/k) · log log σ n)-size data structure is presented, which can be constructed in O(n logσ n) time, and answers queries in time O(log logσ k).
Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays
TLDR
A long-standing barrier is broken with a new data structure that takes O(n log σ) bits, answers suffix array queries in O(log n) time, and can be constructed in O-log σ/ √ log n time using O( n log ρ) bits of space.
Efficient data structures for internal queries in texts
TLDR
This dissertation provides the first optimal data structure for smaller alphabets of size σ n, which handles queries in O(1) time, takes O(n/ logσ n) space, and admits an O-time construction of T from the packed representation of T with Θ(logσn) characters in each machine word.
Dynamic suffix array with polylogarithmic queries and updates
TLDR
This work proposes the first data structure that supports both suffix array queries and text updates in O(polylog n) time (achieving O( log4 n) and O(log3+o(1) n)time, respectively).
String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure
TLDR
This paper proposes the first algorithm that breaks the O(n)-time barrier for BWT construction, based on a novel concept of string synchronizing sets, which is of independent interest and shows that this technique lets us design a data structure of the optimal size O(N/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O( n/ logn) time.

References

SHOWING 1-10 OF 56 REFERENCES
Fast Compressed Self-indexes with Deterministic Linear-Time Construction
TLDR
A compressed suffix array representation that, on a text T of length n over an alphabet of size $$\sigma $$ σ, can be built in O ( n ) deterministic time, within working space, and counts the number of occurrences of any pattern P in T in time.
Time-Optimal Top-k Document Retrieval
TLDR
A data structure that uses linear space and reports the most relevant documents that contain a query pattern, which supports an ample set of important relevance measures, such as the number of times P appears in a document (called term frequency), a fixed document importance, and the minimal distance between two occurrences of $P$ in a documents.
Range Predecessor and Lempel-Ziv Parsing
TLDR
The rightmost variant of the Lempel-Ziv parsing of a string, where the goal is to associate with each phrase of the parsing its most recent occurrence in the input string, is considered, and a faster construction method for efficient 2D orthogonal range reporting is provided.
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
TLDR
The result presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice.
Locally Consistent Parsing for Text Indexing in Small Space
TLDR
It is shown how to use ideas based on the Locally Consistent Parsing technique, that was introduced by Sahinalp and Vishkin, in some non-trivial ways in order to improve the known results for the above problems.
Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time
We show that the compressed suffix array and the compressed suffix tree of a string $T$ can be built in $O(n)$ deterministic time using $O(n\log\sigma)$ bits of space, where $n$ is the string length
Alphabet-Dependent String Searching with Wexponential Search Trees
TLDR
One particular application of the above bounds (static and dynamic) are suffix trees, where the main technical contribution is a weighted variant of exponential search trees, which might be of independent interest.
Efficient Fully-Compressed Sequence Representations
TLDR
This work achieves compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; compressed permutations π with times for π() and π−1() improved to loglogarithmic; and the first compressed representation of dynamic collections of disjoint sets.
Small-Space LCE Data Structure with Constant-Time Queries
TLDR
A data structure of O(z \tau^2 + \frac{n}{\tau}) words of space which answers LCE queries in O(1) time and can be built in O (n \log \sigma) time, where 1 \leq \ tau \leqi \sqrt{n} is a parameter, z is the size of the Lempel-Ziv 77 factorization of w and \s Sigma is the alphabet size.
Deterministic Indexing for Packed Strings
TLDR
A new string index is created in the deterministic and packed setting such that given a packed pattern string of length m the authors can support queries in (deterministic) time O(m/a + log m + log log s), where a = w /log s is the number of characters packed in a word of size w = log n.
...
...