State-of-the-art anonymization of medical records using an iterative machine learning framework.

  title={State-of-the-art anonymization of medical records using an iterative machine learning framework.},
  author={Gy{\"o}rgy Szarvas and Rich{\'a}rd Farkas and R{\'o}bert Busa-Fekete},
  journal={Journal of the American Medical Informatics Association},
Objective: The anonymization of medical records is of great importance in the human life sciences because a de-identified text can be made publicly available for non-hospital researchers as well, to facilitate research on human diseases. Here the authors have developed a de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act. Design: We… 

Figures and Tables from this paper

Anonymization of Sensitive Information in Medical Health Records
This paper has tried to identify PHI on medical records written in Spanish language by building a neural network involving an LSTM-CRF model and applying two approaches for the anonymization of medical records.
Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
This work presents a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes) to identify, classify and anonymize Protected Health Information (PHI) with PHI categories.
Patient Data De-Identification: A Conditional Random-Field-Based Supervised Approach
Insight is provided into the de-identification task, its major challenges, techniques to address challenges, detailed analysis of the results and direction of future improvement, and a supervised machine learning technique for solving the problem of patient data deidentification.
This work improved the applicability of the NeuroNER system to Indian data and improved its efficiency and reliability.
The Role of Inference in the Anonymization of Medical Records
It is shown how sensitive attributes can be exploited to derive information about the QIs, leading to many privacy hazards for the patients whose records are shared.
Viewpoint Paper: Evaluating the State-of-the-Art in Automatic De-identification
An overview of this de-identification challenge is provided, the data and the annotation process are described, the evaluation metrics are explained, the nature of the systems that addressed the challenge are discussed, the results of received system runs are analyzed, and directions for future research are identified.
A Hybrid Semi-supervised Learning Approach to Identifying Protected Health Information in Electronic Medical Records
This paper proposes a hybrid semi-supervised learning approach to identifying protected health information (PHI) in electronic medical records that combines a machine learning-based method with a conditional random fields model and a rule- based method in a post-processing phase to handle 8 PHI types with disambiguity.
Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach
  • N. Phuong, Vo Thi Ngoc Chau
  • Computer Science
    2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)
  • 2016
This paper proposes an automatic de-identification solution in a multilevel hybrid semi-supervised learning paradigm with a key focus on correctly identifying protected health information (PHI) in the EMRs by combining a machine learning- based method with a conditional random fields model and a rule-based method in a post-processing phase to handle the PHI types with disambiguity.
De-identifying an EHR Database - Anonymity, Correctness and Readability of the Medical Record
A de-identification algorithm is developed that uses lists of named entities, simple language analysis, and special rules to generate a Danish EHR database with real medical records, but related to artificial persons.


Identification of patient name references within medical documents using semantic selectional restrictions
The proposed algorithm is based on estimating the fitness of candidate patient name references to a set of semantic selectional restrictions that place tight contextual requirements upon candidate words in the report text and are determined automatically from a manually tagged corpus of training reports.
Computer-assisted de-identification of free text in the MIMIC II database
An evaluation of methods for computer-assisted removal and replacement of protected health information (PHI) from free-text nursing notes collected in the intensive care unit as part of the MIMIC II project is presented.
Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records
This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation, and developed a method for tuning the balance of recall vs. precision in the Carafe system.
Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text
It is shown that one can deidentify medical discharge summaries using support vector machines that rely on a statistical representation of local context, which contributes more to deidentification than dictionaries and hand-tailed heuristics.
Medical document anonymization with a semantic lexicon
An original system for locating and removing personally-identifying information in patient records, using natural language processing tools provided by the MEDTAG framework: a semantic lexicon specialized in medicine, and a toolkit for word-sense and morpho-syntactic tagging.
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.
By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information.
A successful technique for removing names in pathology reports using an augmented search and replace method
A tool based on the fact that the vast majority of proper names in pathology reports occur in pairs that was easy to implement and was largely based on publicly available data sources to achieve accuracy similar to previous attempts at de-identification.
Identifying Personal Health Information Using Support Vector Machines
This work explores the use of Support Vector Machines to recognize personal health information in medical discharge summaries by using an information extraction system designed for newswire text, plus a set of rules that incorporate entityspecific knowledge.
Replacing personally-identifying information in medical records, the Scrub system.
  • L. Sweeney
  • Computer Science
    Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium
  • 1996
We define a new approach to locating and replacing personally-identifying information in medical records that extends beyond straight search-and-replace procedures, and we provide techniques for
Automatic Deidentification by using Sentence Features and Label Consistency
The present paper proposes a new approach employing three types of non-local features, which does not come from surrounding words: sentence features, corresponding to the previous/next sentence information and label consistency, preferring the same label for the same word sequence.