Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting

@inproceedings{Qiao2020TextPT,
  title={Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting},
  author={Liang Qiao and Sanli Tang and Zhanzhan Cheng and Yunlu Xu and Yi Niu and Shiliang Pu and Fei Wu},
  booktitle={AAAI Conference on Artificial Intelligence},
  year={2020}
}
Many approaches have recently been proposed to detect irregular scene text and achieved promising results. However, their localization results may not well satisfy the following text recognition part mainly because of two reasons: 1) recognizing arbitrary shaped text is still a challenging task, and 2) prevalent non-trainable pipeline strategies between text detection and text recognition will lead to suboptimal performances. To handle this incompatibility problem, in this paper we propose an… 

Figures and Tables from this paper

Boundary TextSpotter: Toward Arbitrary-Shaped Scene Text Spotting

This paper studies the problem of scene text spotting, which aims to detect and recognize text from cluttered images simultaneously and proposes an end-to-end trainable neural network named Boundary TextSpotter, which can easily deal with the text of arbitrary shapes.

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

A new end-to-end scene text spotting framework termed SwinTextSpotter is proposed, using a transformer encoder with dynamic head as the detector and a novel Recognition Conversion mechanism to explicitly guide text localization through recognition loss.

MSA: an end-to-end scene text spotter with mask-supervised-attention

A novel end-to-end scene text recognition framework based on the Swin-Transformer FPN backbone network, which adopts the instance segmentation method to obtain the text mask and binarizes it to directly locate its polygon boundaries, to solve the problem of low fitting efficiency of the text sequence recognition module.

DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting

A novel Detection-agnostic End-to-End Recognizer, DEER, framework that reduces the tight dependency between detection and recognition modules by bridging them with a single reference point for each text instance, instead of using detected regions.

TextBlock: Towards Scene Text Spotting without Fine-grained Detection

A pioneering network termed TextBlock is developed, and a heuristic text block generation method as well as a multi-instance block-level recognition module are proposed, which are expected to have a potential impact on scene text spotting research in the future.

MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

This paper proposes a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation, and achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks.

ARTS: Eliminating Inconsistency between Text Detection and Recognition with Auto-Rectification Text Spotter

This work introduces and proves the existence of the inconsistency problem and proposes a differentiable Auto-Rectification Module (ARM) together with a new training strategy to enable propagating recognition loss back into detection branch, so that the authors' detection branch can be jointly optimized by detection and recognition targets.

Pointer Networks for Arbitrary-Shaped Text Spotting

This paper presents a highly efficient one-stage method named PointerNet for arbitrary-shaped text spotting, which does not rely on text detection and opens a novel spotting-by-character-detection paradigm and proposes a simple yet highly effective strategy named pointer that learns the 2D offset from thecenter of the current character to the center of the subsequent character.

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

DeepSolo is a simple detection transformer baseline that lets a single De coder with Eplicit P oints Solo for text detection and recognition simultaneously, and introduces a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training.

Dynamic Low-Resolution Distillation for Cost-Efficient End-to-End Text Spotting

A novel cost-efficient Dynamic Low-resolution Distillation (DLD) text spotting framework, which aims to infer images in different small but recognizable resolutions and achieve a better balance between accuracy and efficiency.
...

References

SHOWING 1-10 OF 43 REFERENCES

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

The recognition module of the Mask TextSpotter method is investigated separately, which significantly outperforms state-of-the-art methods on both regular and irregular text datasets for scene text recognition.

TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network

An end-to-end trainable network architecture, named TextNet, is proposed, which is able to simultaneously localize and recognize irregular text from images, and can achieve state-of-the-art performance on irregular datasets by a large margin.

Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks

A unified network that simultaneously localizes and recognizes text with a single forward pass is proposed, avoiding intermediate processes, such as image cropping, feature re-calculation, word separation, and character grouping.

FOTS: Fast Oriented Text Spotting with a Unified Network

This work proposes a unified end-to-end trainable Fast Oriented Text Spotting (FOTS) network for simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks, and introduces RoIRotate to share convolutional features between detection and Recognition.

TextBoxes++: A Single-Shot Oriented Scene Text Detector

An end- to-end trainable fast scene text detector, named TextBoxes++, which detects arbitrary-oriented scene text with both high accuracy and efficiency in a single network forward pass, and significantly outperforms the state-of-the-art approaches for word spotting and end-to-end text recognition tasks on popular benchmarks.

TextField: Learning a Deep Direction Field for Irregular Scene Text Detection

A novel text detector named TextField, which outperforms the state-of-the-art methods by a large margin on two curved text datasets: Total-Text and SCUT-CTW1500, respectively; TextField also achieves very competitive performance on multi-oriented datasets: ICDAR 2015 and MSRA-TD500.

Self-Organized Text Detection with Minimal Post-processing via Border Learning

  • Yue WuP. Natarajan
  • Computer Science
    2017 IEEE International Conference on Computer Vision (ICCV)
  • 2017
The results of the extensive experiments show that the proposed solution achieves comparable performance, and often outperforms state-of-theart approaches on standard benchmarks–even though the solution only requires minimal post-processing to parse a bounding box from a detected text map, while others often require heavy post- processing.

An End-to-End TextSpotter with Explicit Alignment and Attention

A novel text-alignment layer is proposed that allows it to precisely compute convolutional features of a text instance in arbitrary orientation, which is the key to boost the performance of the model on the ICDAR 2015.

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition

A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed, which generates an effective yet much smaller model, which is more practical for real-world application scenarios.

ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification

  • Fangneng ZhanShijian Lu
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
Extensive experiments show that the proposed ESIR is capable of rectifying scene text distortions accurately, achieving superior recognition performance for both normal scene text images and those suffering from perspective and curvature distortions.