A de-identifier for medical discharge summaries

  title={A de-identifier for medical discharge summaries},
  author={{\"O}zlem Uzuner and Tawanda C. Sibanda and Yuan Luo and Peter Szolovits},
  journal={Artificial intelligence in medicine},
  volume={42 1},
Automatic de-identification of textual documents in the electronic health record: a review of recent research
A review of recent research in automated de-identification of narrative text documents from the electronic health record finds methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize.
Improved de-identification of physician notes through integrative modeling of both public and private medical text
The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI, and train a model to recognize non-PHI words and phrases that appear in public medical texts.
De-identification of clinical narratives through writing complexity measures
De-identification of patient notes with recurrent neural networks
The first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems, is introduced, which outperforms the state-of-the-art systems.
Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents
Evaluation of existing automated text de-identification methods and tools, as applied to Veterans Health Administration (VHA) clinical documents, to assess which methods perform better with each category of PHI found in clinical notes; and when new methods are needed to improve performance.
Named Entity Recognition in Unstructured Medical Text Documents
The NER toolkits of OpenNLP and spaCy are applied to identify and subsequently remove/encode PII information from IME reports prepared by the physician and it is found that both platforms achieve high performance at de-identification and that a spaCy model trained with a 70–30 train-test data split is most performant.
A de-identifier for electronic medical records based on a heterogeneous feature set
This thesis describes an extended and specialized Named Entity Recognizer (NER) to detect instances of Protected Health Information in electronic medical records (A de-identifier) and shows that the benefit from having an inclusive set of features outweighs the harm from the very large dimensionality of the resulting classification problem.
Rule-based information extraction from patients' clinical data


Identification of patient name references within medical documents using semantic selectional restrictions
The proposed algorithm is based on estimating the fitness of candidate patient name references to a set of semantic selectional restrictions that place tight contextual requirements upon candidate words in the report text and are determined automatically from a manually tagged corpus of training reports.
Viewpoint Paper: Evaluating the State-of-the-Art in Automatic De-identification
An overview of this de-identification challenge is provided, the data and the annotation process are described, the evaluation metrics are explained, the nature of the systems that addressed the challenge are discussed, the results of received system runs are analyzed, and directions for future research are identified.
Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records
CaRE combines the solutions to de-identification, semantic category recognition, assertion classification, and semantic relationship classification into a single application that facilitates the easy extraction of semantic information from medical text.
Computer-Assisted De-Identification of Free-text Nursing Notes
A semi-automated method was developed to allow clinicians to highlight PHI on the screen of a tablet PC and to compare and combine the selections of different experts reading the same notes, and expert adjudication demonstrated that inter-human variability was high.
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.
By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information.
Concept-match medical data scrubbing. How pathology text can be used in research.
  • J. Berman
  • Medicine
    Archives of pathology & laboratory medicine
  • 2003
Computerized scrubbing can render the textual portion of a pathology report harmless for research purposes, and this article addresses the problem of data scrubbing.
Development and evaluation of an open source software tool for deidentification of pathology reports
There was variation in performance among reports from the three institutions, highlighting the need for site-specific customization, which is easily accomplished with the open source, HIPAA compliant, deidentification tool.
The Unified Medical Language System.
The UMLS project and current developments in high-speed, high-capacity international networks are converging in ways that have great potential for enhancing access to biomedical information.
Replacing personally-identifying information in medical records, the Scrub system.
  • L. Sweeney
  • Computer Science
    Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium
  • 1996
We define a new approach to locating and replacing personally-identifying information in medical records that extends beyond straight search-and-replace procedures, and we provide techniques for
Research Paper: Fast Exact String Pattern-matching Algorithms Adapted to the Characteristics of the Medical Language
The time performance of exact string pattern matching can be greatly improved if an efficient algorithm is used, and considering the growing amount of text handled in the electronic patient record, it is worth implementing this efficient algorithm.