Viewpoint Paper: Evaluating the State-of-the-Art in Automatic De-identification

@article{Uzuner2007ViewpointPE,
  title={Viewpoint Paper: Evaluating the State-of-the-Art in Automatic De-identification},
  author={{\"O}zlem Uzuner and Yuan Luo and Peter Szolovits},
  journal={Journal of the American Medical Informatics Association : JAMIA},
  year={2007},
  volume={14 5},
  pages={
          550-63
        }
}
To facilitate and survey studies in automatic de-identification, as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors organized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this de-identification challenge, describes the data and the annotation process, explains the evaluation metrics, discusses the nature of the systems that… 

Figures and Tables from this paper

Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
TLDR
NLP-based de-identification shows excellent performance that rivals the performance of human annotators and scales up to millions of documents quickly and inexpensively.
Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records
TLDR
This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation, and developed a method for tuning the balance of recall vs. precision in the Carafe system.
The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge.
A review of Automatic end-to-end De-Identification: Is High Accuracy the Only Metric?
TLDR
It is argued that despite the improvements in accuracy there remain challenges in surrogate generation and replacements of identified PHIs, and the risks posed to patient protection and privacy.
BoB, a best-of-breed automated text de-identification system for VHA clinical documents
TLDR
The authors' system successfully addressed VHA clinical document de-identification, and its hybrid stepwise design demonstrates robustness and efficiency, prioritizing patient confidentiality while leaving most clinical information intact.
Automatic de-identification of textual documents in the electronic health record: a review of recent research
TLDR
A review of recent research in automated de-identification of narrative text documents from the electronic health record finds methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize.
Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results
TLDR
This paper summarizes the settings, data and results of the first shared track on anonymization of medical documents in Spanish, the MEDDOCAN (Medical Document Anonymization) track, which relied on a carefully constructed synthetic corpus of clinical case documents following annotation guidelines for sensitive data based on the analysis of the EU General Data Protection Regulation.
Viewpoint Paper: Repurposing the Clinical Record: Can an Existing Natural Language Processing System De-identify Clinical Notes?
TLDR
The authors tested the ability of MedLEE to remove protected health information (PHI) by comparing 100 outpatient clinical notes with the corresponding XML-tagged output, and found that PHI in the output was highly transformed, potentially making re-identification more difficult.
An evaluation of feature sets and sampling techniques for de-identification of medical records
TLDR
The results show that the context features (previous and next terms) are particularly important and the sampling technique can be used to increase recall with minimal impact on precision and the overall HIDE system achieves token-level precision.
...
...

References

SHOWING 1-10 OF 79 REFERENCES
Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records
TLDR
This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation, and developed a method for tuning the balance of recall vs. precision in the Carafe system.
State-of-the-art anonymization of medical records using an iterative machine learning framework.
TLDR
A de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act is developed.
Viewpoint Paper: Identifying Patient Smoking Status from Medical Discharge Records
TLDR
A Natural Language Processing (NLP) challenge on automatically determining the smoking status of patients from information found in their discharge records and analysis of the results highlighted the fact that discharge summaries express smoking status using a limited number of textual features.
Second i2b2 workshop on natural language processing challenges for clinical records.
  • Ozlem Uzuner
  • Medicine, Computer Science
    AMIA ... Annual Symposium proceedings. AMIA Symposium
  • 2008
TLDR
The obesity challenge is discussed, some approaches to automatically identifying obese patients and obesity co-morbidities from medical records are reviewed, and the challenge results are presented.
Computer-assisted de-identification of free text in the MIMIC II database
TLDR
An evaluation of methods for computer-assisted removal and replacement of protected health information (PHI) from free-text nursing notes collected in the intensive care unit as part of the MIMIC II project is presented.
An Iterative Method for the De-identification of Structured Medical Text
TLDR
This work introduces here a novel, iterative NER approach intended for use on semi-structured documents like discharge records and it can successfully identify PHI in several steps.
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.
TLDR
By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information.
Overview of BioCreAtIvE: critical assessment of information extraction for biology
TLDR
The first BioCreAtIvE assessment provided state-of-the-art performance results for a basic task (gene name finding and normalization), where the best systems achieved a balanced 80% precision / recall or better, which potentially makes them suitable for real applications in biology.
Development and evaluation of an open source software tool for deidentification of pathology reports
TLDR
There was variation in performance among reports from the three institutions, highlighting the need for site-specific customization, which is easily accomplished with the open source, HIPAA compliant, deidentification tool.
A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries
TLDR
It is concluded that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent can identify a large portion of the pertinent negatives from discharge summaries.
...
...