Post-OCR Paragraph Recognition by Graph Convolutional Networks

  title={Post-OCR Paragraph Recognition by Graph Convolutional Networks},
  author={Renshen Wang and Yasuhisa Fujii and Ashok Popat},
  journal={2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
We propose a new approach for paragraph recognition in document images by spatial graph convolutional networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a β-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With pure layout input features, the GCN model size is 3~4 orders of magnitude… 
1 Citations

Unified Line and Paragraph Detection by Graph Convolutional Networks

A graph convolutional network is used to predict the relations between text detection boxes and then build both levels of clusters from these predictions, demonstrating that the unified approach can be highly efficient while still achieving state-of-the-art quality for detecting paragraphs in public benchmarks and real-world images.

ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction

This work proposes Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents that consistently improves existing GCNs with a margin up to 8.4% F1-score.

Towards End-to-End Unified Scene Text Detection and Layout Analysis

A novel method is proposed that is able to simultaneously detect scene text and form text clusters in a unified way and achieves state-of-the-art results on multiple scene text detection datasets without the need of complex post-processing.

CaptchaGG: A linear graphical CAPTCHA recognition model based on CNN and RNN

CaptchaGG, a model for recognizing linear graphical CAPTCHAs, which has a simple architecture, extracting features by convolutional neural network, sequence modeling by recurrent neuralnetwork, and finally classification and recognition, can achieve an accuracy of 96% or more recognition at a lower complexity.



Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks

An end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images using a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text.

Table Detection in Invoice Documents by Graph Neural Networks

This work proposes a graph-based approach for detecting tables in document images that makes use of Graph Neural Networks (GNNs) in order to describe the local repetitive structural information of tables in invoice documents.

A Machine Learning Approach for Graph-Based Page Segmentation

This work proposes a new approach for segmenting a document image into its page components, based on simple machine learning models and graph-based techniques, which is easily adapted to the segmentation of a variety of document types.

Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection

This paper proposes a novel unified relational reasoning graph network for arbitrary shape text detection through an innovative local graph that bridges a text proposal model via Convolutional Neural Network and a deep relational reasoning network via Graphconvolutional Network, making the network end-to-end trainable.

DeepLayout: A Semantic Segmentation Approach to Page Layout Analysis

This paper introduces semantic segmentation which is an end-to-end trainable deep neural network which takes only document image as input and predicts per pixel saliency maps and successfully brings RLSA into post-processing procedures to specify the boundaries.

R2 CNN: Rotational Region CNN for Arbitrarily-Oriented Scene Text Detection

A Rotational Region CNN (R2CNN) is designed, which includes a Text Region Proposal Network (Text-RPN) to estimate approximate text regions and a multitask refinement network to get the precise inclined box.

Deep Visual Template-Free Form Parsing

This work presents a learned, template-free solution to detecting pre-printed text and input text/handwriting and predicting pair-wise relationships between them, and shows that the proposed pairing method outperforms heuristic rules and that visual features are critical to obtaining high accuracy.

PubLayNet: Largest Dataset Ever for Document Layout Analysis

The PubLayNet dataset for document layout analysis is developed by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central and demonstrated that deep neural networks trained on Pub LayNet accurately recognize the layout of scientific articles.

Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning

It is shown that the graph convolution of the GCN model is actually a special form of Laplacian smoothing, which is the key reason why GCNs work, but it also brings potential concerns of over-smoothing with many convolutional layers.

Graph Attention Networks

We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior