Corpus ID: 33157514

ANSSI NURMINEN ALGORITHMIC EXTRACTION OF DATA IN TABLES IN PDF DOCUMENTS

@inproceedings{Elomaa2013ANSSINA,
  title={ANSSI NURMINEN ALGORITHMIC EXTRACTION OF DATA IN TABLES IN PDF DOCUMENTS},
  author={Tapio Elomaa},
  year={2013}
}
TAMPERE UNIVERSITY OF TECHNOLOGY Degree Programme in Information Technology NURMINEN, ANSSI: Algorithmic Extraction of Data in Tables in PDF Documents Master of Science Thesis: 64 pages, 4 appendices (8 pages) April 2013 Majoring in: Embedded systems (software emphasis) Examiners: Prof. Tapio Elomaa, MSc. Teemu Heinimäki 
TabbyPDF: Web-Based System for PDF Table Extraction
TLDR
This paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). Expand
A PDF Wrapper System for Table Processing Purpose
Tables are a widely-used structure for data presentation and summarisation in documents but are not yet well-utilised computationally because of the difficulty of extracting their structure and dataExpand
Tab.IAIS: Flexible Table Recognition and Semantic Interpretation System
TLDR
This paper develops two rule-based algorithms that perform the complete table recognition process and support the most frequent table formats found in the scientific literature and develops a graph-based table interpretation method for semantic information extraction. Expand
On Graph-Based Verification for PDF Table Detection
Many non-editable documents are shared in PDF (Portable Document Format). They are typically not accompanied by tags for annotating the page layout, including table positions. One of the importantExpand
A Cell-detection-based Table-structure Recognition Method
TLDR
A method to detect cells by estimating implicit ruled lines, where necessary, to recognize the table structure and demonstrate the effectiveness of the proposed method by experiments using the ICDAR 2013 table competition dataset. Expand
Epi Archive: Automated Synthesis of Global Notifiable Disease Data
TLDR
LANL has developed a tool, Epi Archive, to collect global notifiable disease data automatically and continuously and make it uniform and readily accessible, and designed and wrote code to automate the downloading of the data for each country. Expand
Deep Splitting and Merging for Table Structure Decomposition
TLDR
A pair of novel deep learning models (Split and Merge models) that given an input image, predicts the basic table grid pattern and predicts which grid elements should be merged to recover cells that span multiple rows or columns are presented. Expand
Table Detection for Improving Accessibility of Digital Documents using a Deep Learning Approach
TLDR
The results show that the proposed methodology can be used to reduce the uncertainty experienced by visually impaired people when listening to the contents of tables in digital documents through screen readers. Expand
Acurio Machine Learning applied to improve accessibility of PDF documents for Visually Impaired Users
Digital documents are accessed by visually impaired people (VIP) through screen readers. Traditionally, digital documents were translated to braille text, but screen readers have proved to beExpand
...
1
2
...

References

SHOWING 1-10 OF 13 REFERENCES
pdf2table: A Method to Extract Table Information from PDF Files
TLDR
This work developed several heuristics which together recognize and decompose tables in PDF files and store the extracted data in a structured data format (XML) for easier reuse and shows that purely heuristic-based approaches can achieve good results, especially for lucid tables. Expand
PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents
  • Ermelinda Oro, M. Ruffolo
  • Computer Science
  • 2009 10th International Conference on Document Analysis and Recognition
  • 2009
TLDR
The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents. Expand
The interpretation of tables in texts
This thesis looks at the issues relating to the development of technology capable of processing tables as they appear in textual documents so that their contents may be accessed and furtherExpand
Metrics for evaluating performance in document analysis: application to tables
  • A. C. E. Silva
  • Computer Science
  • International Journal on Document Analysis and Recognition (IJDAR)
  • 2010
TLDR
A new pair of evaluation metrics are proposed that better suit document analysis’ needs and show their application to several table tasks and a road-map for creating Hidden Markov Models for the task is drawn. Expand
A methodology for evaluating algorithms for table understanding in PDF documents
TLDR
The evaluation takes into account three major tasks: table detection, table structure recognition and functional analysis and provides a general and flexible output model for each task along with corresponding evaluation metrics and methods. Expand
Tabular Abstraction, Editing, and Formatting
This dissertation investigates the composition of high-quality tables with the use of electronic tools. A generic model is designed to support the different stages of tabular composition, includingExpand
Towards a common evaluation strategy for table structure recognition algorithms
TLDR
This work describes its experiences in comparing its algorithm for table detection and structure recognition to another recently published system using a freely available dataset of 75 PDF documents and defines several classes of errors to ensure the repeatability of the results and their comparability between different systems from different research groups. Expand
Information extraction as a stepping stone toward story understanding
TLDR
The thought of building a large-scale conceptual natural language processing (NLP) system that can understand open-ended text is daunting even to the most ardent enthusiasts. Expand
A constraint-based approach to table structure derivation
  • Matthew F. Hurst
  • Computer Science
  • Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.
  • 2003
TLDR
An approach to deriving an abstractgeometric model of a table from a physical representation using a graph of constraints which must be satisfied in order to determinate the relative horizontal and vertical position. Expand
PDF-TREX dataset Retrieved 2013-02-24: http://staff.icar.cnr.it/ruffolo/files/PDF-TREX-Dataset.zip
  • 2013
...
1
2
...