• Corpus ID: 3103489

Distributional Semantics Resources for Biomedical Text Processing

@inproceedings{Pyysalo2013DistributionalSR,
  title={Distributional Semantics Resources for Biomedical Text Processing},
  author={Sampo Pyysalo and Filip Ginter and Hans Moen and Tapio Salakoski and Sophia Ananiadou},
  year={2013}
}
The openly available biomedical literature contains over 5 billion words in publication abstracts and full texts. Recent advances in unsupervised language processing methods have made it possible to make use of such large unannotated corpora for building statistical language models and inducing high quality vector space representations, which are, in turn, of utility in many tasks such as text classification, named entity recognition and query expansion. In this study, we introduce the first… 
PMCVec: Distributed phrase representation for biomedical text processing
TLDR
PMCVec is introduced, an unsupervised technique that generates important phrases from PubMed abstracts and learns embeddings for single words and multi-word phrases simultaneously and produces significant performance gains both qualitatively and quantitatively.
Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute
TLDR
This paper presents a publicly available resource distributing preprocessed biomedical literature including sentence splitting, tokenization, part-of-speech tagging, syntactic parses and named entity recognition, covering the whole of PubMed and PubMed Central Open Access section.
Learning Effective Distributed Representation of Complex Biomedical Concepts
  • Khai Nguyen, R. Ichise
  • Computer Science
    2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)
  • 2018
TLDR
This study used an efficient technique to index all possible concepts of UMLS thesaurus in a huge corpus of 15,4 billion tokens and can obtain the vector representations for more than 650,000 concepts, the largest ever reported resource to date.
Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
TLDR
Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded.
Medical vocabulary mining using distributional semantics on Japanese patient blogs
TLDR
Evaluation of random indexing to ex- tract medical terms from a Japanese blog corpus showed that simi- lar settings are suitable for Japanese as for previously explored Germanic languages, and that distributional semantics is equally useful for semi-automatic expansion of Japanese medical Vocabularies as for med- ical vocabularied languages.
Learning Distributed Word Representations and Applications in Biomedial Natural Language Processing
A common challenge for biomedical natural language processing (BioNLP) is data sparsity. Distributed word representation approaches have been developed recently that represent word by learning from a
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain
TLDR
This work proposes several approaches for sentence‐level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus.
Biomedical Named-Entity Recognition by Hierarchically Fusing BioBERT Representations and Deep Contextual-Level Word-Embedding
TLDR
This paper presents a novel method for biomedical named entity-recognition (BioNER) through hierarchically fusing representations from BioBERT, which is trained on biomedical corpora and Deep contextual-level word embeddings to handle the linguistic challenges within biomedical literature.
A Guide to Dictionary-Based Text Mining.
TLDR
This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking, and discusses a variety of approaches to these tasks.
Improving Biomedical Information Extraction with Word Embeddings Trained on Closed-Domain Corpora
TLDR
The behaviour of a B-NER DL architecture specifically devoted to Italian EHRs is analyzed, focusing on the contribution of different Word Embeddings (WEs) models used as input text representation layer, to show the substantial contribution of WEs trained on a closed domain corpus exclusively formed by documents belonging to the biomedical domain.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
The textual characteristics of traditional and Open Access scientific journals are similar
TLDR
The assumption that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole is examined.
Synonym Extraction of Medical Terms from Clinical Text Using Combinations of Word Space Models
TLDR
It is demonstrated how synonyms of medical terms can be extracted automatically from a large corpus of clinical text using distributional semantics using Random Indexing and Random Permutation, effectively increasing the ability to identify synonymic relations between terms.
An improved corpus of disease mentions in PubMed citations
TLDR
A large-scale disease corpus consisting of 6900 disease mentions in 793 PubMed citations, derived from an earlier corpus is created, which contains rich annotations and makes this disease name corpus a valuable resource for mining disease-related information from biomedical text.
Open-domain Anatomical Entity Mention Detection
TLDR
The AnEM corpus is introduced, a domain- and species-independent resource manually annotated for anatomical entity mentions using a fine-grained classification system, and demonstrates a promising level of performance.
New Tools for Web-Scale N-grams
TLDR
A new set of search tools are described that make use of part-of-speech tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale, which will allow novel sources of information to be applied to long-standing natural language challenges.
Anatomical entity mention recognition at literature scale
TLDR
AnatomyTagger is presented, a machine learning-based system for anatomical entity mention recognition that incorporates a broad array of approaches proposed to benefit tagging, including the use of Unified Medical Language System (UMLS)- and Open Biomedical Ontologies (OBO)-based lexical resources, word representations induced from unlabeled text, statistical truecasing and non-local features.
Developing a Robust Part-of-Speech Tagger for Biomedical Text
TLDR
Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and the authors' tagger exhibits very good precision on all these corpora.
Semantic Compositionality through Recursive Matrix-Vector Spaces
TLDR
A recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length and can learn the meaning of operators in propositional logic and natural language is introduced.
Improving Word Representations via Global Context and Multiple Word Prototypes
TLDR
A new neural network architecture is presented which learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and accounts for homonymy and polysemy by learning multiple embedDings per word.
Efficient Estimation of Word Representations in Vector Space
TLDR
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
...
1
2
3
...