A framework for information extraction from tables in biomedical literature

  title={A framework for information extraction from tables in biomedical literature},
  author={Nikola Milosevic and Cassie Gregson and Robert Hernandez and G. Nenadic},
  journal={International Journal on Document Analysis and Recognition (IJDAR)},
The scientific literature is growing exponentially, and professionals are no more able to cope with the current amount of publications. Text mining provided in the past methods to retrieve and extract information from text; however, most of these approaches ignored tables and figures. The research done in mining table data still does not have an integrated approach for mining that would consider all complexities and challenges of a table. Our research is examining the methods for extracting… 

Auto-CORPus: Automated and Consistent Outputs from Research Publications

An automated pipeline that cleans HTML files from biomedical literature using the Auto-CORPus package and developed a model to standardize the section headers based on the Information Artifact Ontology.

Toward Automated Data Extraction: A Pilot Survey of the Structure of Tabular Data in Clinical Comparative Literature (Preprint)

The measurement context formats presented here, broadly classified into three classes that cover 92% of studies, form a basis for understanding the frequency of different reporting styles, supporting automated detection of data format for extraction of metrics.

Toward Automated Data Extraction According to Tabular Data Structure: Cross-sectional Pilot Survey of the Comparative Clinical Literature

The measurement context formats presented here, broadly classified into three classes that cover 92% of studies, form a basis for understanding the frequency of different reporting styles, supporting automated detection of the data format for extraction of metrics.

Opportunities and challenges of text mining inmaterials research

This review is directed at the broad class of researchers aiming to learn the fundamentals of TM as applied to the materials science publications.

Table extraction, analysis, and interpretation: the current state of the TabbyDOC project

This paper summarizes the TabbyDOC project’s results that are intended for the following tasks: automation of fine-tuning artificial neural networks for table detection in document images, a synthesis of programs for spreadsheet data transformation driven by user-defined rules of table analysis and interpretation, and generating RDF-triples from entities extracted from relational tables.

Assessment of Information Extraction Techniques, Models and Systems

It was observed that the hybrid methods outperform the other methods due to their versatile nature to address various document formats.

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

This work presents Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardisation and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics.

A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)

A structure based method is proposed that extracts the figures and their descriptions by analyzing the file layout and is saved in a database with a specific structure and is indexed for retrieval in the search engine.

CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

CREGEX (Classifier Regular Expression), a biomedical text classifier based on an automatically generated regular-expressions-based feature space, which outperformed both the SVM and NB classifiers in terms of accuracy and F-measure but used a fewer amount of training examples to achieve the same performance.



A Scalable Hybrid Approach for Extracting Head Components from Web Tables

A preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.

Table extraction for answer retrieval

To retrieve answers, the approach creates a cell document, which contains the cell and its metadata (headers, titles) for each table cell, and the retrieval model ranks the cells of the extracted tables using a language-modeling approach.

Converting and Annotating Quantitative Data Tables

New disambiguation strategies based on an ontology are introduced, which allows to improve performance on "sloppy" datasets not yet targeted by existing systems.

Automating the extraction of data from HTML tables with unknown structure

Learning Table Extraction from Examples

A new approach to automated table extraction that exploits formatting cues in semi-structured HTML tables, learns lexical variants from training examples and uses a vector space model to deal with non-exact matches among labels is presented.

The Unified Medical Language System (UMLS): integrating biomedical terminology

The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).

Extraction of Named Entities from Tables in Gene Mutation Literature

This work investigates the challenge of extracting information about genetic mutations from tables, and shows how classifying tabular information can be leveraged for the task of named entity detection for mutations.

A machine learning based approach for table detection on the web

A machine learning based approach to classify each given table entity as either genuine or non-genuine, and designed a novel web document table ground truthing protocol and used it to build a large table ground truth database.