Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text
@inproceedings{Sibanda2006RoleOL, title={Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text}, author={Tawanda C. Sibanda and {\"O}zlem Uzuner}, booktitle={NAACL}, year={2006} }
Deidentification of clinical records is a crucial step before these records can be distributed to non-hospital researchers. Most approaches to deidentification rely heavily on dictionaries and heuristic rules; these approaches fail to remove most personal health information (PHI) that cannot be found in dictionaries. They also can fail to remove PHI that is ambiguous between PHI and non-PHI.Named entity recognition (NER) technologies can be used for deidentification. Some of these technologies…
Tables from this paper
32 Citations
Automatic Deidentification by using Sentence Features and Label Consistency
- Computer Science
- 2006
The present paper proposes a new approach employing three types of non-local features, which does not come from surrounding words: sentence features, corresponding to the previous/next sentence information and label consistency, preferring the same label for the same word sequence.
State-of-the-art anonymization of medical records using an iterative machine learning framework.
- Computer Science
- 2007
A de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act is developed.
An Iterative Method for the De-identification of Structured Medical Text
- Computer Science
- 2006
This work introduces here a novel, iterative NER approach intended for use on semi-structured documents like discharge records and it can successfully identify PHI in several steps.
Research Paper: State-of-the-art Anonymization of Medical Records Using an Iterative Machine Learning Framework
- Computer ScienceJ. Am. Medical Informatics Assoc.
- 2007
A de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act is developed.
A system for de-identifying medical message board text
- Computer Science2010 Ninth International Conference on Machine Learning and Applications
- 2010
A system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges, significantly outperforms other publicly available named entity recognition and de-Identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.
Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records
- Computer ScienceJ. Am. Medical Informatics Assoc.
- 2007
This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation, and developed a method for tuning the balance of recall vs. precision in the Carafe system.
Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research
- Computer ScienceJ. Biomed. Informatics
- 2014
Feature Engineering for Domain Independent Named EntityRecognition and Biomedical Text Mining Applications
- Computer Science
- 2008
The aim was to demonstrate that task-specific feature engineering is beneficial to the overall performance and that for specific text mining tasks one can construct systems that are useful in practice and even compete with humans in processing textual data.
TEXT2TABLE: Medical Text Summarization System Based on Named Entity Recognition and Modality Identification
- Computer ScienceBioNLP@HLT-NAACL
- 2009
Experimental results demonstrate empirically that syntactic information can contribute to the method's accuracy and an SVM-based classifier using syntactic Information is proposed.
References
SHOWING 1-10 OF 20 REFERENCES
Identification of patient name references within medical documents using semantic selectional restrictions
- Computer ScienceAMIA
- 2002
The proposed algorithm is based on estimating the fitness of candidate patient name references to a set of semantic selectional restrictions that place tight contextual requirements upon candidate words in the report text and are determined automatically from a manually tagged corpus of training reports.
A successful technique for removing names in pathology reports using an augmented search and replace method
- MedicineAMIA
- 2002
A tool based on the fact that the vast majority of proper names in pathology reports occur in pairs that was easy to implement and was largely based on publicly available data sources to achieve accuracy similar to previous attempts at de-identification.
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.
- MedicineAmerican journal of clinical pathology
- 2004
By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information.
Computer-assisted de-identification of free text in the MIMIC II database
- Computer ScienceComputers in Cardiology, 2004
- 2004
An evaluation of methods for computer-assisted removal and replacement of protected health information (PHI) from free-text nursing notes collected in the intensive care unit as part of the MIMIC II project is presented.
Concept-match medical data scrubbing. How pathology text can be used in research.
- MedicineArchives of pathology & laboratory medicine
- 2003
Computerized scrubbing can render the textual portion of a pathology report harmless for research purposes, and this article addresses the problem of data scrubbing.
Recognizing names in biomedical texts: a machine learning approach
- Computer ScienceBioinform.
- 2004
The PowerBioNE system is the first system which deals with the cascaded entity name phenomenon and the HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated H MM, support vector machines, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem.
Automatically Generating Extraction Patterns from Untagged Text
- Computer ScienceAAAI/IAAI, Vol. 2
- 1996
This work has developed a system called AutoSlog-TS that creates dictionaries of extraction patterns using only untagged text, and in experiments with the MUG-4 terrorism domain, created a dictionary of extraction pattern that performed comparably to a dictionary created by autoSlog, using only preclassified texts as input.
Medical document anonymization with a semantic lexicon
- Computer ScienceAMIA
- 2000
An original system for locating and removing personally-identifying information in patient records, using natural language processing tools provided by the MEDTAG framework: a semantic lexicon specialized in medicine, and a toolkit for word-sense and morpho-syntactic tagging.
Protein Structures and Information Extraction from Biological Texts: The PASTA System
- Computer ScienceBioinform.
- 2003
PASTA is the first information extraction (IE) system developed for the protein structure domain and one of the most thoroughly evaluated IE system operating on biological scientific text to date.
An Algorithm that Learns What's in a Name
- Computer ScienceMachine Learning
- 2004
IdentiFinderTM, a hidden Markov model that learns to recognize and classify names, dates, times, and numerical quantities, is evaluated and is competitive with approaches based on handcrafted rules on mixed case text and superior on text where case information is not available.