• Corpus ID: 15205659

Biomedical Semantic Indexing using Dense Word Vectors in BioASQ

  title={Biomedical Semantic Indexing using Dense Word Vectors in BioASQ},
  author={Aris Kosmopoulos and Ion Androutsopoulos and Georgios Paliouras},
Background: Biomedical curators are often required to semantically index large numbers of biomedical articles, using hierarchically related labels (e.g., MeSH headings). Large scale hierarchical classification, a branch of machine learning, can facilitate this procedure, but the resulting automatic classifiers are often inefficient because of the very large dimensionality of the dominant bag-of-words representation of texts. Feature selection quickly harms the accuracy of the classifiers in… 

Figures and Tables from this paper

Automated MeSH Indexing of Biomedical Literature Using Contextualized Word Representations

It is argued that current word embedding algorithms can be efficiently used to support the task of biomedical text classification and can be useful as a mechanism for validation and recommendation.

PMCVec: Distributed phrase representation for biomedical text processing

Semantic Classification and Indexing of Open Educational Resources with Word Embeddings and Ontologies

This paper proposes an approach that facilitates curators and instructors to annotate thematically educational content by combining explicit knowledge graph representations with vector-based learning of formal thesaurus terms and shows that it is possible to produce a reasonable set of thematic suggestions which exceed a certain similarity threshold.

A Big Data Approach for Health Data Information Retrieval

The proposed architecture has been developed with the purpose of improving a previous implementation, lowering the computational time and allowing in this way the use of the whole PubMed library as dataset, proving also the usability of this methodology in a real context.

Search and Graph Database Technologies for Biomedical Semantic Indexing: Experimental Analysis

This is the first work that combines search and graph database technologies for the task of biomedical semantic indexing, and the representation of the MeSH thesaurus as a graph database allows the use of graph search algorithms for accessing MeSH information to quickly and easily capture hierarchical relationships.

Improving Biomedical Information Extraction with Word Embeddings Trained on Closed-Domain Corpora

The behaviour of a B-NER DL architecture specifically devoted to Italian EHRs is analyzed, focusing on the contribution of different Word Embeddings (WEs) models used as input text representation layer, to show the substantial contribution of WEs trained on a closed domain corpus exclusively formed by documents belonging to the biomedical domain.

Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization

A number of optimized hyper-parameter sets and pre-trained word2vec and fastText models, available on https://github.com/dterg/bionlp-embed, are provided to optimize and compare these representations for the biomedical domain.

DeepMeSH: deep semantic representation for improving large-scale MeSH indexing

DeepMeSH is a proposed deep semantic representation that incorporates deep semantic information for large-scale MeSH indexing that addresses the two challenges in both citation and MeSH sides.

Some lessons learned using health data literature for smart information retrieval

It is shown how semantic similarity search for natural language texts can be leveraged in biomedical domain by Word Embedding models obtained by word2vec algorithm, exploiting a specifically developed Big Data architecture.

Can Embeddings Adequately Represent Medical Terminology? New Large-Scale Medical Term Similarity Datasets Have the Answer!

The results show that current embeddings are limited in their ability to adequately encode medical terms, and two novel datasets form a challenging new benchmark for the development of medical embedDings able to accurately represent the whole medical terminology.



Large-Scale Semantic Indexing of Biomedical Publications

The participation of the team to the large-scale biomedical semantic indexing task of BioASQ is documents.

Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks

This paper systematically investigated three different types of word representation (WR) features for BNER, including clustering-based representation, distributional representation, and word embeddings, and showed that all the three WR algorithms were beneficial to machine learning-based BNER systems.

Substring selection for biomedical document classification

An algorithm is proposed that omits stemming and, instead, uses the most discriminative substrings as attributes in classification, which is particularly useful when labeled datasets are small.

BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

The main results of the first two BioASQ challenges are presented and the performance of information systems in supporting two tasks that are central to the biomedical question answering process are assessed.

NCBI at the 2014 BioASQ Challenge Task: Large-scale Biomedical Semantic Indexing and Question Answering

This paper reports participation in the 2014 BioASQ tasks on biomedical semantic indexing and question answering and builds on the previous learning-to-rank framework with a special focus on systemati- cally incorporating results of complementary methods for improved performance.

An Incremental Approach for MEDLINE MeSH Indexing

Three approaches are proposed, one building upon another in an incremental way, to automatic MeSH term suggestion: MetaMap-based labeling, which relies on the MetaMap tool to detect MeSH-related concepts for indexing; Search-based labeled, which builds on MetaMap -based approach and further leverages information retrieval techniques for finding similar articles whose existing annotations are used for MeSH suggestion.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification

A broad overview of the several new modeling ideas used that make text classification systems both more effective and scalable are presented, including reduction of inference time complexity for probabilistic classifiers using inverted indices, ensembles of diverse multi-label classifiers and a novel feature-regression based method for scalable ensemble combination.

Support vector machines classification with a very large-scale taxonomy

The first evaluation of Support Vector Machines in web-page classification over the full taxonomy of the Yahoo! categories found that the hierarchical use of SVMs is efficient enough for very large-scale classification; however, in terms of effectiveness, the performance of SVM over the Yahoo!. Directory is still far from satisfactory, which indicates that more substantial investigation is needed.

Hierarchical document categorization with support vector machines

A novel hierarchical classification method that generalizes Support Vector Machine learning and that is based on discriminant functions that are structured in a way that mirrors the class hierarchy is proposed.