Automatic de-identification of electronic medical records using token-level and character-level conditional random fields

@article{Liu2015AutomaticDO,
  title={Automatic de-identification of electronic medical records using token-level and character-level conditional random fields},
  author={Zengjian Liu and Yangxin Chen and Buzhou Tang and Xiaolong Wang and Qingcai Chen and Haodi Li and Jingfeng Wang and Qiwen Deng and Suisong Zhu},
  journal={Journal of biomedical informatics},
  year={2015},
  volume={58 Suppl},
  pages={
          S47-52
        }
}

Figures and Tables from this paper

Patient Data De-Identification: A Conditional Random-Field-Based Supervised Approach
TLDR
Insight is provided into the de-identification task, its major challenges, techniques to address challenges, detailed analysis of the results and direction of future improvement, and a supervised machine learning technique for solving the problem of patient data deidentification.
Patient Data De-Identification
TLDR
This paper proposes a supervised machine learning technique for solving the problem of patient data de- identification, based on the 2014 i2b2 (Informatics for Integrating Biology to the Bedside) de-identification challenge.
Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach
  • N. Phuong, Vo Thi Ngoc Chau
  • Computer Science
    2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)
  • 2016
TLDR
This paper proposes an automatic de-identification solution in a multilevel hybrid semi-supervised learning paradigm with a key focus on correctly identifying protected health information (PHI) in the EMRs by combining a machine learning- based method with a conditional random fields model and a rule-based method in a post-processing phase to handle the PHI types with disambiguity.
Building a Best-in-Class De-identification Tool for Electronic Medical Records Through Ensemble Learning
TLDR
An automated de-identification system that employs an ensemble architecture, incorporating attention-based deep learning models and rule based methods, supported by heuristics for detecting PHI in EHR data that transforms these detected identifiers into plausible, though fictional, surrogates to further obfuscate any leaked identifier.
De-identification of patient notes with recurrent neural networks
TLDR
The first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems, is introduced, which outperforms the state-of-the-art systems.
A Hybrid Semi-supervised Learning Approach to Identifying Protected Health Information in Electronic Medical Records
TLDR
This paper proposes a hybrid semi-supervised learning approach to identifying protected health information (PHI) in electronic medical records that combines a machine learning-based method with a conditional random fields model and a rule- based method in a post-processing phase to handle 8 PHI types with disambiguity.
Survey on RNN and CRF models for de-identification of medical free text
TLDR
A comprehensive survey of work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches finds that RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms.
Is Multiclass Automatic Text De-Identification Worth the Effort?
TLDR
This study suggests that the development of more sophisticated classification of PHI may not be worth the effort in terms of both system accuracy and the usefulness of the output.
...
...

References

SHOWING 1-10 OF 39 REFERENCES
Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
TLDR
NLP-based de-identification shows excellent performance that rivals the performance of human annotators and scales up to millions of documents quickly and inexpensively.
Automated de-identification of free-text medical records
TLDR
An automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc, and is sufficiently generalized and can be customized to handle text files of any format is described.
Viewpoint Paper: Evaluating the State-of-the-Art in Automatic De-identification
TLDR
An overview of this de-identification challenge is provided, the data and the annotation process are described, the evaluation metrics are explained, the nature of the systems that addressed the challenge are discussed, the results of received system runs are analyzed, and directions for future research are identified.
State-of-the-art anonymization of medical records using an iterative machine learning framework.
TLDR
A de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act is developed.
Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents
TLDR
Evaluation of existing automated text de-identification methods and tools, as applied to Veterans Health Administration (VHA) clinical documents, to assess which methods perform better with each category of PHI found in clinical notes; and when new methods are needed to improve performance.
Automatic de-identification of textual documents in the electronic health record: a review of recent research
TLDR
A review of recent research in automated de-identification of narrative text documents from the electronic health record finds methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize.
Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records
TLDR
This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation, and developed a method for tuning the balance of recall vs. precision in the Carafe system.
Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features
TLDR
The results show that SSVMs is a great potential algorithm for clinical NLP research, and both types of unsupervised word representation features are beneficial to clinical NER tasks.
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.
TLDR
By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information.
A hybrid system for temporal information extraction from clinical text
TLDR
The TLink extraction module contains three individual classifiers for TLinks: between events and section times, within a sentence, and across different sentences, and the performance of the system was evaluated using scripts provided by the i2b2 organizers.
...
...