Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs

@inproceedings{Fernandes2022DescribingIF,
  title={Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs},
  author={Daniel Louzada Fernandes and Marcos Henrique Fonseca Ribeiro and F{\'a}bio Ribeiro Cerqueira and Michel Melo Silva},
  booktitle={VISIGRAPP},
  year={2022}
}
Several services for people with visual disabilities have emerged recently due to achievements in Assistive Technologies and Artificial Intelligence areas. Despite the growth in assistive systems availability, there is a lack of services that support specific tasks, such as understanding the image context presented in online content, e.g., webinars. Image captioning techniques and their variants are limited as Assistive Technologies as they do not match the needs of visually impaired people… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 33 REFERENCES

A Hierarchical Approach for Generating Descriptive Image Paragraphs

A model that decomposes both images and paragraphs into their constituent parts is developed, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language.

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Understanding Guided Image Captioning Performance across Domains

The human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity is a key factor for improved performance.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Contextualise, Attend, Modulate and Tell: Visual Storytelling

This work proposes a novel framework Contextualize, Attend, Modulate and Tell (CAMT) that models the temporal relationship among the image sequence in forward as well as backward direction and evaluates the model on the Visual Storytelling Dataset employing both automatic and human evaluation measures.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

This work details the theory and engineering from the winning submission to the 2020 captioning competition, and provides a step towards improved assistive image captioning systems.

Dense Captioning with Joint Inference and Visual Context

A new model pipeline based on two novel ideas, joint inference and context fusion, is proposed, which achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73% compared to the previous best algorithm.

Captioning Images Taken by People Who Are Blind

This work introduces the first image captioning dataset to represent this real use case, which consists of over 39,000 images originating from people who are blind that are each paired with five captions and analyzes modern image Captioning algorithms to identify what makes this new dataset challenging for the vision community.