De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields

@article{Dalianis2010DeidentifyingSC,
  title={De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields},
  author={Hercules Dalianis and Sumithra Velupillai},
  journal={Journal of Biomedical Semantics},
  year={2010},
  volume={1},
  pages={6 - 6}
}
BackgroundIn order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.ResultsWe present work on the creation of two refined variants of a manually annotated Gold… 
A Semi-supervised Approach for De-identification of Swedish Clinical Text
TLDR
A semi-supervised method is proposed, for automatically creating high-quality training data, and shows that the method can be used to improve recall from 84.75% to 89.20% without sacrificing precision to the same extent, dropping from 95.73% to 94.20%.
De-identifying free text of Japanese electronic health records
TLDR
The LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of the authors' rule-based methods, however, machine learning methods are inadequate for processing expressions with low occurrence.
The OpenDeID corpus for patient de-identification
TLDR
The results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time.
Automatic Clinical Text De-Identification: Is It Worth It, and Could It Work for Me?
TLDR
This panel will focus on the issues related with the automatic de-identification of clinical text, including an overview of the domain, a demonstration of good examples of such applications in English and in Swedish with their main authors sharing development and adaptation experiences, and a discussion of the HIPAA “Safe Harbor” de-Identification quality and the risk for re-identified data.
Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep Learning
TLDR
The aim is to compare two machine learning algorithms, Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) applied to a Swedish clinical data set annotated for de-identification, and shows that CRF performs better than deep learning with LSTM.
Bootstrapping a de-identification system for narrative patient records: Cost-performance tradeoffs
A Hybrid Semi-supervised Learning Approach to Identifying Protected Health Information in Electronic Medical Records
TLDR
This paper proposes a hybrid semi-supervised learning approach to identifying protected health information (PHI) in electronic medical records that combines a machine learning-based method with a conditional random fields model and a rule- based method in a post-processing phase to handle 8 PHI types with disambiguity.
Influence of Module Order on Rule-Based De-identification of Personal Names in Electronic Patient Records Written in Swedish
TLDR
Four common rules for de-identification of personal names in EPRs written in Swedish are implemented and evaluated and it is shown that to obtain the highest recall and precision, the rules should be applied in the following order.
Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text
TLDR
It is concluded that it is possible to train transferable models based on pseudonymised Swedish clinical data, but even small narrative and distributional variation could negatively impact performance.
...
...

References

SHOWING 1-10 OF 25 REFERENCES
Annotating and Recognising Named Entities in Clinical Notes
TLDR
A new genre of text which are not well-written, noise prone, ungrammatical and with much cryptic content is introduced, which is a mix of clinical progress notes drawn form an Intensive Care Service and clinical named entities.
Viewpoint Paper: Evaluating the State-of-the-Art in Automatic De-identification
TLDR
An overview of this de-identification challenge is provided, the data and the annotation process are described, the evaluation metrics are explained, the nature of the systems that addressed the challenge are discussed, the results of received system runs are analyzed, and directions for future research are identified.
Automated de-identification of free-text medical records
TLDR
An automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc, and is sufficiently generalized and can be customized to handle text files of any format is described.
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.
TLDR
By the end of the evaluation, the system was reliably and specifically removing safe-harbor identifiers and producing highly readable deidentified text without removing important clinical information.
The Stockholm EPR Corpus – Characteristics and Some Initial Findings
TLDR
The characteristics of the Stockholm Electronic Patient Record Corpus (the SEPR Corpus), an important resource for performing research on clinical data, are described, which contains characteristics that are very interesting from a linguistic point of view, such as domain specific compounds and abbreviations, and various narratives.
Testing Tactics to Localize De-Identification
TLDR
A first gross de-identification step is performed in the hospital for new documents in a language different from English, here French patient reports, and two methods are tested: the first attempts to adapt an existing US de-Identifier for English, the second re-develops a new system which applies the same methods.
Identification of Entity References in Hospital Discharge Letters
TLDR
A system for automatic identification of named entities in Swedish clinical free text, in the form of discharge letters, by applying generic named entity recognition technology with minor adaptations is presented.
Towards a Methodology for Named Entities Annotation
TLDR
This work identifies the applications using named entity recognition and proposes to semantically define the elements to annotate and put forward a number of methodological recommendations to ensure a coherent and reliable annotation scheme.
...
...