SpecTextor: End-to-End Attention-based Mechanism for Dense Text Generation in Sports Journalism

Abstract

Language-guided smart systems can help to design next-generation human-machine interactive applications. The dense text description is one of the research areas where systems learn the semantic knowledge and visual features of each video frame and map them to describe the video’s most relevant subjects and events. In this paper, we consider untrimmed sports videos as our case study. Generating dense descriptions in the sports domain to supplement journalistic works without relying on commentators and experts requires more investigation. Motivated by this, we propose an end-to-end automated text-generator, SpecTextor, that learns the semantic features from untrimmed videos of sports games and generates associated descriptive texts. The proposed approach considers the video as a sequence of frames and sequentially generates words. After splitting videos into frames, we use a pre-trained VGG-16 model for feature extraction and encoding the video frames. With these encoded frames, we posit a Long Short-Term Memory (LSTM) based attention-decoder pipeline that leverages soft-attention mechanism to map the semantic features with relevant textual descriptions to generate the explanation of the game. Because developing a comprehensive description of the game warrants training on a set of dense time-stamped captions, we leverage two available public datasets- ActivityNet Captions and Microsoft Video Description. In addition, we utilized two different decoding algorithms- beam search and greedy search and computed two evaluation metrics- BLEU and METEOR scores.

Publication
IEEE International Conference on Smart Computing (SMARTCOMP)