A Large Dataset of Historical Japanese Documents with Complex Layouts

  title={A Large Dataset of Historical Japanese Documents with Complex Layouts},
  author={Zejiang Shen and Kaixuan Zhang and Melissa Dell},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types… 

Figures and Tables from this paper

OLALA: Object-Level Active Learning for Efficient Document Layout Annotation

An Object-Level Active Learning framework for efficient document layout Annotation, OLALA, where only regions with the most ambiguous object predictions within an image are selected for annotators to label, optimizing the use of the annotation budget.

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks and incorporates a community platform for sharing both pre-trained models and full document digitization pipelines.

Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

This paper proposes a rule-based method, that is evaluated and compare with two Machine-Learning models, namely RIPPER and Gradient Boosting, and shows that this system outperforms the two Machine Learning models, and provides higher Recall results.

DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer

A transformer-based model called DocSegTr is presented for end-to-end instance segmentation of complex layouts in document images that adapts a twin attention module, for semantic reasoning, which helps to become highly computationally e-cient compared with the state-of-the-art approaches.

A Hybrid Information Extraction Approach using Transfer Learning on Richly-Structured Documents

A hybrid information extraction approach for documents with complex structures is proposed, which features a pipeline which uses OCR for plain textual information extraction and transfer learning for table detection from documents with such rich and complex structure.

Cross-Domain Document Layout Analysis via Unsupervised Document Style Guide

This paper integrated the document quality assessment and the document cross-domain analysis into a unified framework that is composed of three components, Document Layout Generator, Document Elements Decorator, and Document Style Discriminator.

Classification of handwritten annotations in mixed-media documents

An al- algorithm for generating a novel mixed-media document dataset, Annotated Docset, that consists of 14 classes of machine-printed and handwritten elements and annotations and a novel loss function, Dense Loss, which can correctly identify small objects in complex documents when used in fully convolutional networks.

Deformable deep networks for instance segmentation of overlapping multi page handwritten documents

This work introduces a new document image dataset called IMMI (Indic Multi Manuscript Images), and proposes an approach which generates synthetic images to augment sourced non-synthetic images to aid deep network training.

Document Layout Analysis with Aesthetic-Guided Image Augmentation

Experimental results prove that the proposed image layer modeling method can better deal with the fine-grained segmented document of the non-Manhattan layout.

Knowledge Graph Embedding-Based Domain Adaptation for Musical Instrument Recognition

This article presents a new method for domain adaptation based on Knowledge graph embeddings that incorporates these semantic vector spaces as a key ingredient to guide the domain adaptation process.



PubLayNet: Largest Dataset Ever for Document Layout Analysis

The PubLayNet dataset for document layout analysis is developed by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central and demonstrated that deep neural networks trained on Pub LayNet accurately recognize the layout of scientific articles.

DIVA-HisDB: A Precisely Annotated Large Dataset of Challenging Medieval Manuscripts

A publicly available historical manuscript database DIVA-HisDB is introduced for the evaluation of several Document Image Analysis (DIA) tasks and a layout analysis ground-truth which has been iterated on, reviewed, and refined by an expert in medieval studies is provided.

dhSegment: A Generic Deep-Learning Approach for Document Segmentation

This paper proposes an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks and shows that a single CNN-architecture can be used across tasks with competitive results.

READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents

This paper collects and annotates 2036 archival document images from different locations and time periods and proposes a new evaluation scheme that is based on baselines, which has no need for binarization and it can handle skewed as well as rotated text lines.

Multi-task Layout Analysis for Historical Handwritten Documents Using Fully Convolutional Networks

Experimental results on the public dataset DIVA-HisDB containing challenging medieval manuscripts demonstrate the effectiveness and superiority of the proposed multi-task layout analysis method.

DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images

In contrast to most existing table detection and structure recognition methods, which are applicable only to PDFs, DeepDeSRT processes document images, which makes it equally suitable for born-digital PDFs as well as even harder problems, e.g. scanned documents.

A Realistic Dataset for Performance Evaluation of Document Layout Analysis

This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents, with strong emphasis on comprehensive and detailed representation of both complex and simple layouts, and on colour originals.

ICDAR2019 Competition on Recognition of Documents with Complex Layouts - RDCL2019

An objective comparative evaluation of page segmentation and region classification methods for docu-ments with complex layouts indicates that an innovative approach has a clear advantage but there is still a considerable need to develop robust methods that deal with layout challenges, especially with the non-textual content.

Evaluation of deep convolutional nets for document image classification and retrieval

A new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs), and makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories.

The ENP image and ground truth dataset of historical newspapers

A baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance.