• Publications
  • Influence
An End-to-End Baseline for Video Captioning
TLDR
End-to-end training significantly improves over the traditional, disjoint training process and is evaluated on the Microsoft Research Video Description and MSR Video to Text benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.
End-to-End Video Captioning
TLDR
End-to-end training significantly improves over the traditional, disjoint training process and is evaluated on the Microsoft Research Video Description and the MSR Video to Text benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.
Exploiting Synchronized Lyrics And Vocal Features For Music Emotion Detection
TLDR
This work shows a comparison between text-based and audio-based deep learning classification models using different techniques from Natural Language Processing and Music Information Retrieval domains and concludes that using vocals only, instead of the whole audio data improves the overall performances of the audio classifier.