Evaluating distributed word representations for capturing semantics of biomedical concepts

@inproceedings{Th2015EvaluatingDW,
  title={Evaluating distributed word representations for capturing semantics of biomedical concepts},
  author={Muneeb Th and Sunil Kumar Sahu and Ashish Anand},
  booktitle={BioNLP@IJCNLP},
  year={2015}
}
Recently there is a surge in interest in learning vector representations of words using huge corpus in unsupervised manner. Such word vector representations, also known as word embedding, have been shown to improve the performance of machine learning models in several NLP tasks. However efficiency of such representation has not been systematically evaluated in biomedical domain. In this work our aim is to compare the performance of two state-of-the-art word embedding methods, namely word2vec… Expand
Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain
TLDR
An inclusive study on interpretability of word embeddings in the medical domain, focusing on the role of sparse methods, sees that sparse word vectors show far more interpretability while preserving the performance of their original vectors in downstream tasks. Expand
Knowledge-Base Enriched Word Embeddings for Biomedical Domain
TLDR
A new word embedding based model for biomedical domain is proposed that jointly leverages the information from available corpora and domain knowledge in order to generate knowledge-base powered embeddings. Expand
Feature Importance for Biomedical Named Entity Recognition
TLDR
This paper surveys the features used in bioNLP, and evaluates each feature’s utility in a sampleBioNLP task: the N2C2 2018 named entity recognition challenge, finding that using fastText word embeddings results in a significantly higher F\(_1\) score than using any other individual feature. Expand
Medical Word Embeddings for Spanish: Development and Evaluation
TLDR
The steps to adapt datasets for the first time to Spanish for medical NLP in Spanish, as well as the creation of in-domain medical word embeddings for the Spanish using the state-of-the-art FastText model, proved that the embeddins are suitable for use in medical N LP in the Spanish language, and are more accurate than general-domain ones. Expand
Integrating extra knowledge into word embedding models for biomedical NLP tasks
TLDR
The main idea is to construct a weighted graph from knowledge bases (KBs) to represent structured relationships among words/concepts and propose a GCBOW model and a GSkip-gram model respectively by integrating such a graph into the original CBOW and Skip-gram models via graph regularization. Expand
Augmenting word embeddings through external knowledge-base for biomedical application
TLDR
The 13% improvement in the correlation to experts, shown on experiments involving biomedical concept similarity and relatedness task validates the effectiveness of the proposed approach and demonstrates the importance of incorporating human curated knowledge in the process of generating word embeddings. Expand
On Using Composite Word Embeddings To Improve Biomedical Term Similarity
TLDR
This paper proposes a novel contextual embedding for a “wide sentential context” and generates composite word embedding achieving a multi-scale word representation and proves that the composite embedding performs better than the present individual state of art techniques on both intrinsic and extrinsic evaluations. Expand
Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine
TLDR
Bio-SimVerb and Bio-SimLex enable intrinsic evaluation of word representations and highlight the importance of developing dedicated evaluation resources for NLP in biomedicine for particular word classes (e.g. verbs). Expand
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain
TLDR
This work proposes several approaches for sentence‐level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. Expand
Vector representations of multi-word terms for semantic relatedness
TLDR
These baseline co-occurrence vectors are compared against dimensionality reduced vectors created using singular value decomposition (SVD), and word2vec word embeddings using continuous bag of words, and skip-gram models to find optimal vector dimensionalities for the vectors produced by these techniques. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
GloVe: Global Vectors for Word Representation
TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure. Expand
Measures of semantic similarity and relatedness in the biomedical domain
TLDR
There is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures. Expand
Efficient Estimation of Word Representations in Vector Space
TLDR
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities. Expand
Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation
TLDR
It was found that the ranking and retrieved results generated by word2vec were not of sufficient quality for automatic population of knowledge bases and ontologies, but could serve as a starting point for further manual curation. Expand
Semantic Compositionality through Recursive Matrix-Vector Spaces
TLDR
A recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length and can learn the meaning of operators in propositional logic and natural language is introduced. Expand
Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features
TLDR
A machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets. Expand
Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.
TLDR
The results of the study confirm the existence of a measurable mental representation of semantic relatedness between medical terms that is distinct from similarity and independent of the context in which the terms occur. Expand
Distributed Representations of Words and Phrases and their Compositionality
TLDR
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling. Expand
A Neural Probabilistic Language Model
TLDR
This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Expand
Community Evaluation and Exchange of Word Vectors at wordvectors.org
TLDR
This work presents a website and suite of offline tools that facilitate evaluation of word vectors on standard lexical semantics benchmarks and permit exchange and archival by users who wish to find good vectors for their applications. Expand
...
1
2
3
...