Audio-Visual Evaluation of Oratory Skills

  title={Audio-Visual Evaluation of Oratory Skills},
  author={Tzvi Michelson and Shmuel Peleg},
  journal={2021 Third International Conference on Transdisciplinary AI (TransAI)},
  • Tzvi Michelson, Shmuel Peleg
  • Published 1 September 2021
  • Sociology
  • 2021 Third International Conference on Transdisciplinary AI (TransAI)
What makes a talk successful? Is it the content or the presentation? We try to estimate the contribution of the speaker’s oratory skills to the talk’s success, while ignoring the content of the talk. By oratory skills we refer to facial expressions, motions and gestures, as well as the vocal features. We use TED Talks as our dataset, and measure the success of each talk by its view count. Using this dataset we train a neural network to assess the oratory skills in a talk through three factors… 

Figures and Tables from this paper

A Peek at Peak Emotion Recognition
It is found that despite using very small datasets, features extracted from deep learning models can achieve results significantly better than humans in this task.


Looking to listen at the cocktail party
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.
Online feedback system for public speakers
An online feedback system for public speakers, in which emotion recognised from body language of speakers is regarded as the primary component for analysis, and a posture and gesture representation method based on Laban Movement Analysis was adopted.
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.
VGGFace2: A Dataset for Recognising Faces across Pose and Age
A new large-scale face dataset named VGGFace2 is introduced, which contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject, and the automated and manual filtering stages to ensure a high accuracy for the images of each identity are described.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
AutoManner: An Automated Interface for Making Public Speakers Aware of Their Mannerisms
An intelligent interface that can automatically extract human gestures using Microsoft Kinect to make speakers aware of their mannerisms is presented, using a sparsity-based algorithm, Shift Invariant Sparse Coding, to automatically extract the patterns of body movements.
Presentation Trainer, your Public Speaking Multimodal Coach
The user experience evaluation of participants who used the Presentation Trainer to practice for an elevator pitch is presented, showing that the feedback provided by thepresentation Trainer has a significant influence on learning.
Rhema: A Real-Time In-Situ Intelligent Interface to Help People with Public Speaking
Rhema, an intelligent user interface for Google Glass to help people with public speaking that automatically detects the speaker's volume and speaking rate in real time and provides feedback during the actual delivery of speech.
Augmenting Social Interactions: Realtime Behavioural Feedback using Social Signal Processing Techniques
Logue is presented, a system that provides realtime feedback on the presenters' openness, body energy and speech rate during public speaking and analyses the user's nonverbal behaviour using social signal processing techniques and gives visual feedback on a head-mounted display.