Label embedding for text recognition

  title={Label embedding for text recognition},
  author={Jos{\'e} A. Rodr{\'i}guez-Serrano and Florent Perronnin},
  booktitle={British Machine Vision Conference},
The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields (CRF). This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in… 

Figures from this paper

Label Embedding: A Frugal Baseline for Text Recognition

The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

Word Spotting and Recognition with Embedded Attributes

An approach in which both word images and text strings are embedded in a common vectorial subspace, allowing one to cast recognition and retrieval tasks as a nearest neighbor problem and is very fast to compute and, especially, to compare.

Supervised mid-level features for word image representation

  • Albert Gordo
  • Computer Science
    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2015
This paper proposes to learn local mid-level features suitable for building word image representations by leveraging character bounding box annotations on a small set of training images, and achieves results comparable with or better than the state-of-the-art on matching and recognition tasks using global descriptors of only 96 dimensions.

Recognition and Retrieval in Natural Scene Images

This thesis proposes an iterative method, which alternates between finding the most likely solution and refining the interaction potentials, and presents two contrasting end to end recognition frameworks for scene text analysis on scene images.

Scene Text Recognition and Retrieval for Large Lexicons

This paper proposes an iterative method, which alternates between finding the most likely solution and refining the interaction potentials, and presents a conditional random field model defined on potential character locations and the interactions between them.


A convolutional neural network based architecture which incorporates a Conditional Random Field graphical model, taking the whole word image as a single input, which achieves state-of-the-art accuracy in lexicon-constrained scenarios, without being specifically modelled for constrained recognition.

Scene Text Recognition with Sliding Convolutional Character Models

The proposed scene text recognition method with character models on convolutional feature map bases on character models trained free of lexicon, and can recognize unknown words has a number of appealing properties.

Understanding Text in Scene Images

This thesis proposes a robust text segmentation (binarization) technique, and uses it to improve the recognition performance of scene text and presents an energy minimization framework that exploits both bottom-up and top-down cues for recognizing words extracted from street images.

Reading Text in the Wild with Convolutional Neural Networks

An end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval and a real-world application to allow thousands of hours of news footage to be instantly searchable via a text query is demonstrated.

LEWIS: Latent Embeddings for Word Images and Their Semantics

The goal of this work is to bring semantics into the tasks of text recognition and retrieval in natural images by proposing a convolutional neural network with a weighted ranking loss objective that ensures that the concepts relevant to the query image are ranked ahead of those that are not relevant.



End-to-end scene text recognition

While scene text recognition has generally been treated with highly domain-specific methods, the results demonstrate the suitability of applying generic computer vision methods.

Scene Text Recognition using Higher Order Language Priors

A framework is presented that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary, and achieves significant improvement in word recognition accuracies without using a restricted word list.

Large-Lexicon Attribute-Consistent Text Recognition in Natural Images

A new model for the task of word recognition in natural images that simultaneously models visual and lexicon consistency of words in a single probabilistic model is proposed and outperforms state-of-the-art methods for cropped word recognition.

Top-down and bottom-up cues for scene text recognition

This work presents a framework that exploits both bottom-up and top-down cues in the problem of recognizing text extracted from street images, and shows significant improvements in accuracies on two challenging public datasets, namely Street View Text and ICDAR 2003.

Word Spotting in the Wild

It is argued that the appearance of words in the wild spans this range of difficulties and a new word recognition approach based on state-of-the-art methods from generic object recognition is proposed, in which object categories are considered to be the words themselves.

Real-time scene text localization and recognition

The proposed end-to-end real-time scene text localization and recognition method achieves state-of-the-art text localization results amongst published methods and it is the first one to report results for end- to-end text recognition.

Towards more effective distance functions for word image matching

It is shown that a weighted Euclidean distance can outperform DTW for matching word images, and the learnt distance functions can be extended to a new database to obtain accurate retrieval.

Word image matching using dynamic time warping

  • T. RathR. Manmatha
  • Computer Science
    2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings.
  • 2003
This work presents an algorithm for matching handwritten words in noisy historical documents that performs better and is faster than competing matching techniques and presents experimental results on two different data sets from the George Washington collection.

Large-scale image retrieval with compressed Fisher vectors

This article shows why the Fisher representation is well-suited to the retrieval problem: it describes an image by what makes it different from other images, and why it should be compressed to reduce their memory footprint and speed-up the retrieval.

Supervised semantic indexing

This article proposes Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match and proposes several improvements to the basic model, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH).