String Indexing for Patterns with Wildcards

@article{Bille2013StringIF,
  title={String Indexing for Patterns with Wildcards},
  author={Philip Bille and Inge Li G{\o}rtz and Hjalte Wedel Vildh{\o}j and S{\o}ren Vind},
  journal={Theory of Computing Systems},
  year={2013},
  volume={55},
  pages={41-60}
}
We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following results. A linear space index with query time O(m+σjloglogn+occ). This significantly improves the previously best known linear space index by Lam et al. (in Proc. 18th ISAAC, pp. 846–857, [2007]), which requires query time Θ(jn) in the worst case… 
Gapped Indexing for Consecutive Occurrences
TLDR
A variant of string indexing, where the goal is to compactly represent the string such that given two patterns P1 and P2 and a gap range the authors can quickly find the consecutive occurrences with distance in [α, β], is considered.
Space-Efficient String Indexing for Wildcard Pattern Matching
TLDR
These are the first non-trivial data structures for this problem that need $o(n\log n)$ bits of space.
Data Structure Lower Bounds for Document Indexing Problems
We study data structure problems related to document indexing and pattern matching queries and our main contribution is to show that the pointer machine model of computation can be extremely useful
Algorithms and Data Structures for Strings, Points and Integers: or, Points about Strings and Strings about Points
TLDR
This dissertation presents a O(n) space data structure that supports fingerprint queries, and is the first for general (unbalanced) SLPs that answers fingerprint queries without decompressing any text, and are the first to dynamically maintain a string under a compression scheme that can achieve better than entropy compression.
String Indexing for Top-k Close Consecutive Occurrences
TLDR
Two new time-space trade-offs are given for the string indexing for top-$k$ close consecutive occurrences problem (SITCCO), including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.
Error Tree: A Tree Structure for Hamming & Edit Distances & Wildcards Matching
Error Tree is a novel tree structure that is mainly oriented to solve the approximate pattern matching problems, Hamming and edit distances, as well as the wildcards matching problem. The input is a
Detecting Pattern Efficiently with Don't Cares
TLDR
This paper introduces an efficient simple method which can locate all occurrences of pattern P of k subpatterns with “don’t cares” of length m in text S of length n using a predefined computational method.
Frequency Based Indexing Technique for Pattern Matching 1852
TLDR
The proposed indexing technique is an attempt to answer the queries based on the LIKE ‘%...%’ search without requiring full table scan which is shown through the empirical evaluation of the proposed scheme.
Matching and Compression of Strings with Automata and Word Packing
TLDR
This paper considers subsequence automata with default transitions, that is, special transitions to be taken only if none of the regular transitions match the current character, and which do not consume the currentCharacter, and presents a novel hierarchical automata construction of independent interest.
String Indexing for Patterns with Wildcards
TLDR
This work considers the problem of indexing a string t to report the occurrences of a query pattern p containing m characters and j wildcards, and obtains an index with query time O(m+j+occ) using space O(\sigma^{k^2} n \log^k\log n)$, where k is the maximum number of wildcards allowed in the pattern.
...
...

References

SHOWING 1-10 OF 52 REFERENCES
A linear size index for approximate pattern matching
TLDR
The feasibility of devising a linear-size index that still has a time complexity linear in m is investigated and an O(n)-space index is given that supports k-error matching in O(m + occ + (logn)$^{k({\it k}+1)}$ log logn) worst-case time.
Space Efficient Indexes for String Matching with Don't Cares
TLDR
The solution to the pattern-only case improves the matching time of the previous work tremendously in practice, and can be extended to handle optional wildcards, each of which can match zero or one character.
Dotted Suffix Trees A Structure for Approximate Text Indexing
TLDR
This work addresses text indexing for approximate matching, given a text which undergoes some preprocessing to generate an index, and can later query this index to identify the places where a string occurs up to a certain number of errors k (edition distance).
Text indexing with errors
Indexing with Gaps
TLDR
This paper proposes a solution for k gaps one with preprocessing time O(nG2k logk n log log n) and space of O(m + 2k log Log n), where m = Σi=1 |pi|.
Finding Patterns with Variable Length Gaps or Don't Cares
TLDR
New algorithms to handle the pattern matching problem where the pattern can contain variable length gaps are presented and are shown to be useful in many other contexts.
Fast Algorithms for Finding Nearest Common Ancestors
TLDR
An algorithm for a random access machine with uniform cost measure (and a bound of $\Omega (\log n)$ on the number of bits per word) that requires time per query and preprocessing time is presented, assuming that the collection of trees is static.
Pattern Matching Algorithms with Don't Cares
TLDR
This paper presents algorithms for pattern matching, where either the pattern P or the text T can contain “don’t care” characters, and can solve the pattern matching problem in O(n +m + α) time, where α is the total number of occurrences of the component subpatterns.
Efficient string matching with wildcards and length constraints
TLDR
A complete algorithm, SAIL, is designed that returns each matching substring of P in T as soon as it appears in T in an O(n+klmg) time with a O(lm) space overhead.
Succinct Text Indexing with Wildcards
TLDR
The first succinct index for a text that contains wildcards is presented, which doubles the size, yet it reduces the matching time to O (m log*** + m logd + occ ), where m is the length of the query text.
...
...