Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text

@inproceedings{Sibanda2006RoleOL,
  title={Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text},
  author={Tawanda C. Sibanda and {\"O}zlem Uzuner},
  booktitle={NAACL},
  year={2006}
}
Deidentification of clinical records is a crucial step before these records can be distributed to non-hospital researchers. Most approaches to deidentification rely heavily on dictionaries and heuristic rules; these approaches fail to remove most personal health information (PHI) that cannot be found in dictionaries. They also can fail to remove PHI that is ambiguous between PHI and non-PHI.Named entity recognition (NER) technologies can be used for deidentification. Some of these technologies… 
Automatic Deidentification by using Sentence Features and Label Consistency
TLDR
The present paper proposes a new approach employing three types of non-local features, which does not come from surrounding words: sentence features, corresponding to the previous/next sentence information and label consistency, preferring the same label for the same word sequence.
State-of-the-art anonymization of medical records using an iterative machine learning framework.
TLDR
A de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act is developed.
An Iterative Method for the De-identification of Structured Medical Text
TLDR
This work introduces here a novel, iterative NER approach intended for use on semi-structured documents like discharge records and it can successfully identify PHI in several steps.
Research Paper: State-of-the-art Anonymization of Medical Records Using an Iterative Machine Learning Framework
TLDR
A de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act is developed.
A system for de-identifying medical message board text
TLDR
A system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges, significantly outperforms other publicly available named entity recognition and de-Identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.
Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records
TLDR
This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation, and developed a method for tuning the balance of recall vs. precision in the Carafe system.
Feature Engineering for Domain Independent Named EntityRecognition and Biomedical Text Mining Applications
TLDR
The aim was to demonstrate that task-specific feature engineering is beneficial to the overall performance and that for specific text mining tasks one can construct systems that are useful in practice and even compete with humans in processing textual data.
TEXT2TABLE: Medical Text Summarization System Based on Named Entity Recognition and Modality Identification
TLDR
Experimental results demonstrate empirically that syntactic information can contribute to the method's accuracy and an SVM-based classifier using syntactic Information is proposed.
...
...

References

SHOWING 1-10 OF 20 REFERENCES
Identification of patient name references within medical documents using semantic selectional restrictions
TLDR
The proposed algorithm is based on estimating the fitness of candidate patient name references to a set of semantic selectional restrictions that place tight contextual requirements upon candidate words in the report text and are determined automatically from a manually tagged corpus of training reports.
A successful technique for removing names in pathology reports using an augmented search and replace method
TLDR
A tool based on the fact that the vast majority of proper names in pathology reports occur in pairs that was easy to implement and was largely based on publicly available data sources to achieve accuracy similar to previous attempts at de-identification.
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.
TLDR
By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information.
Computer-assisted de-identification of free text in the MIMIC II database
TLDR
An evaluation of methods for computer-assisted removal and replacement of protected health information (PHI) from free-text nursing notes collected in the intensive care unit as part of the MIMIC II project is presented.
Concept-match medical data scrubbing. How pathology text can be used in research.
  • J. Berman
  • Medicine
    Archives of pathology & laboratory medicine
  • 2003
TLDR
Computerized scrubbing can render the textual portion of a pathology report harmless for research purposes, and this article addresses the problem of data scrubbing.
Recognizing names in biomedical texts: a machine learning approach
TLDR
The PowerBioNE system is the first system which deals with the cascaded entity name phenomenon and the HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated H MM, support vector machines, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem.
Automatically Generating Extraction Patterns from Untagged Text
  • E. Riloff
  • Computer Science
    AAAI/IAAI, Vol. 2
  • 1996
TLDR
This work has developed a system called AutoSlog-TS that creates dictionaries of extraction patterns using only untagged text, and in experiments with the MUG-4 terrorism domain, created a dictionary of extraction pattern that performed comparably to a dictionary created by autoSlog, using only preclassified texts as input.
Medical document anonymization with a semantic lexicon
TLDR
An original system for locating and removing personally-identifying information in patient records, using natural language processing tools provided by the MEDTAG framework: a semantic lexicon specialized in medicine, and a toolkit for word-sense and morpho-syntactic tagging.
Protein Structures and Information Extraction from Biological Texts: The PASTA System
TLDR
PASTA is the first information extraction (IE) system developed for the protein structure domain and one of the most thoroughly evaluated IE system operating on biological scientific text to date.
An Algorithm that Learns What's in a Name
TLDR
IdentiFinderTM, a hidden Markov model that learns to recognize and classify names, dates, times, and numerical quantities, is evaluated and is competitive with approaches based on handcrafted rules on mixed case text and superior on text where case information is not available.
...
...