Data Set Used
GOALS Given a short YouTube video, output a natural language sentence that describes the main activity in the video. When the model is not confident enough it should produce a less specific answer, but not over generalize.
We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with " real-world " knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text… (More)