Video Rewrite: driving visual speech with audio

@article{Bregler1997VideoRD,
  title={Video Rewrite: driving visual speech with audio},
  author={Christoph Bregler and M. Covell and Malcolm Slaney},
  journal={Proceedings of the 24th annual conference on Computer graphics and interactive techniques},
  year={1997}
}
  • C. BreglerM. CovellM. Slaney
  • Published 3 August 1997
  • Art
  • Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors’ lip motions to the new soundtrack. Video Rewrite automatically labels the phonemes in the training data and in the new audio track. Video Rewrite reorders the mouth images in the training footage to match the phoneme sequence of the new… 

Figures from this paper

Video rewrite: visual speech synthesis from video

Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage, and is the first facial-animation system to automate all the labeling and assembly tasks required to resyncexisting footage to a new soundtrack.

Text-based editing of talking-head video

This work proposes a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts).

Text-based Editing of Talking-head Video

This work proposes a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts).

VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track

This paper builds on high‐quality monocular capture of 3D facial performance, lighting and albedo of the dubbing and target actors, and uses audio analysis in combination with a space‐time retrieval method to synthesize a new photo‐realistically rendered and highly detailed 3D shape model of the mouth region to replace the target performance.

[Paper] Speech-driven Face Reenactment for a Video Sequence

A system for reenacting a person’s face driven by speech, coined S2TH (speech to talking head), does not require any special hardware to capture the 3D geometry of faces but uses the state-of-the-art method for facial geometry regression.

Rendering a personalized photo-real talking head from short video footage

This system can synthesize a highly photo-real talking head in sync with the given speech signals (natural or TTS synthesized) and won the first place in the A/V consistency contest in LIPS Challenge(2009), perceptually evaluated by recruited human subjects.

Sample-based synthesis of photo-realistic talking heads

  • E. CosattoH. Graf
  • Computer Science
    Proceedings Computer Animation '98 (Cat. No.98EX169)
  • 1998
A system that generates photo-realistic video animations of talking heads from existing video footage using image recognition techniques, and uses precise multi channel facial recognition techniques to track facial parts, and derives the exact 3D position of the head, enabling the automatic extraction of normalized face parts.

Joint Audio-Video Driven Facial Animation

The quality of the proposed system's facial animation generation surpasses that of the recent state-of-the-art systems.

Fine-grained talking face generation with video reinterpretation

This work proposes a coarse-to-fine tree-like architecture for synthesizing realistic talking face frames directly from audio clips that can generate fine-grained talking face videos that are not only synchronous with the input audios but also maintain visual details from the input face images.

Fine-grained talking face generation with video reinterpretation

This work proposes a coarse-to-fine tree-like architecture for synthesizing realistic talking face frames directly from audio clips that can generate fine-grained talking face videos that are not only synchronous with the input audios but also maintain visual details from the input face images.
...

References

SHOWING 1-10 OF 28 REFERENCES

A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface

An automatic field motion image synthesis scheme (driven by speech) and a real-time image synthesis design are presented to realize an intelligent human-machine interface or intelligent communication system with talking head images.

Automated lip-sync: Background and techniques

  • John Lewis
  • Physics
    Comput. Animat. Virtual Worlds
  • 1991
It is indicated that the automatic derivation of mouth movement from a speech soundtrack is a tractable problem and a common speech synthesis method, linear prediction, is adapted to provide simple and accurate phoneme recognition.

Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion

This paper explores the use of local parametrized models of image motion for recovering and recognizing the non-rigid and articulated motion of human faces and shows how expressions can be recognized from the local parametric motions in the presence of significant head motion.

Performance-driven facial animation

A means of acquiring the expressions of real faces, and applying them to computer-generated faces is described as an "electronic mask" that offers a means for the traditional talents of actors to be flexibly incorporated in digital animations.

Synthesis of Speaker Facial Movement to Match Selected Speech Sequences

A system is described which allows for the synthesis of a video sequence of a realistic-appearing talking human head using image processing rather than physical modeling techniques to create video frames.

A real-time French text-to-speech system generating high-quality synthetic speech

The main features of the CNET diphone-based text-to-speech system for French language are described and it provides notably improved sound quality and naturalness in comparison to commercially available systems.

A unified approach to coding and interpreting face images

A compact parametrised model of facial appearance which takes into account all sources of variability and can be used for tasks such as image coding, person identification, pose recovery, gender recognition and expression recognition is described.

Animating images with drawings

The work described here extends the power of 2D animation with a form of texture mapping conveniently controlled by line drawings, generalizing the prescriptive power of animated sequences and encouraging reuse of animated motion.

Nonlinear manifold learning for visual speech recognition

A system based on hidden Markov models and this learned lip manifold that significantly improves the performance of acoustic speech recognizers in degraded environments is described and preliminary results on a purely visual lip reader are presented.

Computer generated animation of faces

  • F. Parke
  • Computer Science
    ACM Annual Conference
  • 1972
It was determined that approximating the surface of a face with a polygonal skin containing approximately 250 polygons defined by about 400 vertices is sufficient to achieve a realistic face.