• Corpus ID: 232257709

No Intruder, no Validity: Evaluation Criteria for Privacy-Preserving Text Anonymization

  title={No Intruder, no Validity: Evaluation Criteria for Privacy-Preserving Text Anonymization},
  author={Maximilian Mozes and Bennett Kleinberg},
For sensitive text data to be shared among NLP researchers and practitioners, shared documents need to comply with data protection and privacy laws. There is hence a growing interest in automated approaches for text anonymization. However, measuring such methods’ performance is challenging: missing a single identifying attribute can reveal an individual’s identity. In this paper, we draw attention to this problem and argue that researchers and practitioners developing automated text… 

Tables from this paper

Challenges and Open Problems of Legal Document Anonymization
This paper aims to summarize and highlight the open and symmetrical problems from the fields of structured and unstructured text anonymization and the possible methods for anonymizing legal documents discussed and illustrated by case studies from the Hungarian legal practice.
Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP
A potential checklist for responsible data (re-)use is proposed that could both standardise the peer review of conference submissions, as well as enable a more in-depth view of published research across the community.


Automated anonymization of text documents
Evaluation showed that the use of the tagging and the generalization methods facilitates the reading of an anonymized text while preventing some semantic drifts caused by the remotion of the original information.
Anonymization of Unstructured Data via Named-Entity Recognition
This work proposes to use a named-entity recognition tagger based on machine learning to build a system capable of detecting all attributes that have privacy implications (identifiers, quasi-identifiers and sensitive attributes).
AnonyMate: A Toolkit for Anonymizing Unstructured Chat Data
The privacy protection toolkit, AnonyMate, is presented, which is built to anonymize both personal identifying information (PII) as well as corporate identify information (CII) in humancomputer dialogue text data.
Towards Personal Data Identification and Anonymization Using Machine Learning Techniques
The current approaches to identify personal data to anonymize are mainly based on text identification executed via regular expression scripts that are not dynamic enough to identify different formats of personal information.
Web-based text anonymization with Node.js: Introducing NETANOS (Named entity-based Text Anonymization for Open Science)
Netanos (Named Entity-based Text ANonymization for Open Science) is a natural language processing software that anonymizes texts by identifying and replacing named entities and provides three alternative anonymization types.
Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations
This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of d_χ-privacy designed to achieve geo-indistinguishability in location data.
Anonymization for the GDPR in the Context of Citizen and Customer Relationship Management and NLP
This work listed five functional requirements for an anonymization process but faced some difficulties to implement a solution that fully meets these requirements, and proposed a practical compromise which currently satisfies users and could also be applied to other sectors like the medical or financial ones.
Authorship Attribution for Forensic Investigation with Thousands of Authors
A novel authorship attribution model combining both profile-based and instance-based approaches to reduce the size of the candidate authors to a small number and narrow the scope of investigation with a high level of accuracy is proposed.
Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep Learning
The aim is to compare two machine learning algorithms, Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) applied to a Swedish clinical data set annotated for de-identification, and shows that CRF performs better than deep learning with LSTM.
PharmacoNER Tagger: a deep learning-based tool for automatically finding chemicals and drugs in Spanish medical texts
This work adapts an existing neural NER system, NeuroNER, to the particular domain of Spanish clinical case texts, and extends the neural network to be able to take into account additional features apart from the plain text.