• Corpus ID: 199448266

Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results

@inproceedings{Marimon2019AutomaticDO,
  title={Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results},
  author={Montserrat Marimon and Aitor Gonzalez-Agirre and Ander Intxaurrondo and Heidy Rodriguez and Jose Lopez Martin and Marta Villegas and Martin Krallinger},
  booktitle={IberLEF@SEPLN},
  year={2019}
}
There is an increasing interest in exploiting the content of electronic health records by means of natural language processing and text-mining technologies, as they can result in resources for improving patient health/safety, aid in clinical decision making, facilitate drug repurposing or precision medicine. To share, re-distribute and make clinical narratives accessible for text mining research purposes, it is key to fulfill legal conditions and address restrictions related data protection and… 

Figures and Tables from this paper

Text Mining of Medical Documents in Spanish: Semantic Annotation and Detection of Recommendations
TLDR
This paper presents an approach to automatically label documents using appropriate medical terms by applying text mining techniques and exploiting semantic resources, and describes a technique that attempts to detect practice recommendations for doctors automatically in clinical guides.
De-identifying Spanish medical texts - Named Entity Recognition applied to radiology reports
TLDR
The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports and does not require a big training corpus, thus it can be easily extended to other languages and medical texts, such as electronic health records.
De-identifying Spanish medical texts - named entity recognition applied to radiology reports
TLDR
The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports and does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.
Creating and Evaluating a Synthetic Norwegian Clinical Corpus for De-Identification
TLDR
An already existing Norwegian synthetic clinical corpus, NorSynthClinical, has been extended with PHIs and annotated by two annotators, obtaining an inter-annotator agreement of 0.94 F1-measure.
Closing the Gap: Joint De-Identification and Concept Extraction in the Clinical Domain
TLDR
A stacked model with restricted access to privacy sensitive information and a multitask model is proposed for concept extraction on automatically anonymized data and joint models for de-identification and concept extraction are investigated.
De-identification of Clinical Text for Secondary Use: Research Issues
TLDR
The various challenges concerning the re-use of unstructured clinical data, in particular in the form of clinical text, are discussed, and the impact of approaches based on named entity recognition and replacing sensitive data with surrogates, as well as the lack of measures for usability and re-identification risk are discussed.
Anonymization of Sensitive Information in Medical Health Records
TLDR
This paper has tried to identify PHI on medical records written in Spanish language by building a neural network involving an LSTM-CRF model and applying two approaches for the anonymization of medical records.
A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
TLDR
A new de-identification data set in Italian has been developed and a stacked word representation form has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities.
A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records
TLDR
A new de-identification data set in Italian has been developed and a stacked word representation form has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 26 REFERENCES
De-identification of clinical notes in French: towards a protocol for reference corpus development
De-Identification of Clinical Free Text in Dutch with Limited Training Data: A Case Study
TLDR
A machine learning deidentification system for clinical free text in Dutch, relying on best practices from the state of the art in de-identification of English-language texts is presented.
Viewpoint Paper: Evaluating the State-of-the-Art in Automatic De-identification
TLDR
An overview of this de-identification challenge is provided, the data and the annotation process are described, the evaluation metrics are explained, the nature of the systems that addressed the challenge are discussed, the results of received system runs are analyzed, and directions for future research are identified.
Preserving medical correctness, readability and consistency in de-identified health records
TLDR
A health record database contains structured data fields that identify the patient, such as patient ID, patient name, e-mail and phone number, which are fairly easy to de-identify, but also occur in fields with doctors’ free-text notes written in an abbreviated style that cannot be analyzed grammatically.
Finding Mentions of Abbreviations and Their Definitions in Spanish Clinical Cases: The BARR2 Shared Task Evaluation Results
TLDR
The overall aim of this effort was to evaluate strategies for detecting automatically mentions of abbreviations in running text, as well as returning their corresponding definition given the corresponding context from Spanish clinical case studies.
Building a Spanish/Catalan health records corpus with very sparse protected information labelled
TLDR
This paper proposes an iterative method for building corpus with labelled PHI from a large unlabelled corpus with a very sparse population of target PHI, and makes use of manually defined rules specified in the form of Augmented Transition Networks, thus minimizing the cost of manually annotating very sparse EHRs corpora.
Anonymization of General Practioner Medical Records
TLDR
This work presents the requirements and goals of anonymization, and proposes methods including utilization of database structure, dictionaries, heuristics and natural language processing for anonymizing patient records in general, but with focus on general practioner records gathered from a Profdoc Vision database.
Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm
TLDR
A novel track called "Technical interoperability and performance of annotation servers" was launched under the umbrella of the BioCreative text mining evaluation effort to enable the continuous assessment of technical aspects of text annotation web servers.
The Biomedical Abbreviation Recognition and Resolution (BARR) Track: Benchmarking, Evaluation and Importance of Abbreviation Recognition Systems Applied to Spanish Biomedical Abstracts
TLDR
The aim of the first Biomedical Abbreviation Recognition and Resolution (BARR) track, posed at the IberEval 2017 evaluation campaign, was to assess and promote the development of systems for generating a sense inventory of medical abbreviations.
...
1
2
3
...