MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

@article{Lei2020MARTMR,
  title={MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning},
  author={Jie Lei and Liwei Wang and Yelong Shen and Dong Yu and Tamara L. Berg and Mohit Bansal},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.05402}
}
  • Jie Lei, Liwei Wang, +3 authors Mohit Bansal
  • Published 2020
  • Computer Science
  • ArXiv
  • Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence… CONTINUE READING

    Figures, Tables, and Topics from this paper.

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 39 REFERENCES

    Move Forward and Tell: A Progressive Generator of Video Descriptions

    VIEW 14 EXCERPTS
    HIGHLY INFLUENTIAL

    End-to-End Dense Video Captioning with Masked Transformer

    VIEW 15 EXCERPTS
    HIGHLY INFLUENTIAL

    Adversarial Inference for Multi-Sentence Video Description

    VIEW 17 EXCERPTS
    HIGHLY INFLUENTIAL

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    VIEW 22 EXCERPTS
    HIGHLY INFLUENTIAL

    VideoBERT: A Joint Model for Video and Language Representation Learning

    VIEW 5 EXCERPTS
    HIGHLY INFLUENTIAL