String Sanitization Under Edit Distance: Improved and Generalized

@inproceedings{Mieno2021StringSU,
  title={String Sanitization Under Edit Distance: Improved and Generalized},
  author={Takuya Mieno and Solon P. Pissis and Leen Stougie and Michelle Sweering},
  booktitle={CPM},
  year={2021}
}
Let $W$ be a string of length $n$ over an alphabet $\Sigma$, $k$ be a positive integer, and $\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\mathrm{ED}}$ such that: (i) no string of $\mathcal{S}$ occurs in $X_{\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\Sigma$ is the same in $W$ and in $X_{\mathrm{ED}}$; and (iii) $X_{\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and… 

Figures from this paper

Matching Patterns with Variables Under Edit Distance

TLDR
The problem of matching patterns with variables under edit distance is considered, but it is shown that the problem becomes intractable already for unary patterns, consisting of repeated occurrences of a single variable interleaved with terminals.

References

SHOWING 1-10 OF 41 REFERENCES

A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

TLDR
This work addresses the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights and presents an algorithm for comparing two {run-length} encoded strings of length m and n, compressed into m' and n' runs, respectively, in O(m'n + n'm) complexity.

All Highest Scoring Paths in Weighted Grid Graphs and Their Application to Finding All Approximate Repeats in Strings

TLDR
This work builds a data structure that supports O(mn log m) time queries about the weight of any of the O(m2n) best paths from the vertices in column 0 of the graph to all other vertices, and presents a simple O(n2 log n) time and $\Theta(n^2)$ space algorithm to find all approximate tandem repeats xy within a string of size n.

Asymptotic Behavior of the Lempel-Ziv Parsing Scheme and Digital Search Trees

A Succinct Four Russians Speedup for Edit Distance Computation and One-against-many Banded Alignment

TLDR
This work extends the classic result of Masek and Paterson which computes the edit distance between two strings in O(m2/ logm) time to remove the dependence on ψ even when edits have arbitrary costs from a penalty matrix and shows a new algorithm for the fundamental problem of one-against-many banded alignment.

Quadratic Conditional Lower Bounds for String Problems and Dynamic Time Warping

TLDR
A framework for proving quadratic-time hardness of similarity measures is introduced, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability, and conditional lower bounds based on the Strong Exponential Time Hypothesis also apply to string problems that are not necessarily similarity measures.

Combinatorial Algorithms for String Sanitization

TLDR
A heuristic, MCSR-ALGO, is proposed, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns areNot introduced, and occurrences of spurious patterns are prevented.

Approximate matching of regular expressions.

On the sorting-complexity of suffix tree construction

TLDR
A recursive technique for building suffix trees that yields optimal algorithms in different computational models that match the sorting lower bound and for an alphabet consisting of integers in a polynomial range the authors get the first known linear-time algorithm.

Hide and Mine in Strings: Hardness and Algorithms

TLDR
A study on the fundamental relation between data sanitization and frequent pattern mining, in the context of sequential data, and proposes integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under certain realistic assumptions on the problem parameters.