Shared task on multimodal story generation at the ****Generation Challenges at INLG 2024

Overview

Introducing the Visually Grounded Story Generation (VGSG) Challenge, a shared task at the Generation Challenges ****at INLG 2024! Our goal is to generate coherent and visually grounded stories from multiple images using vision and language models.

Here is our call for participation.

Feel Free to join the Discord Server if you have any questions for the organizers.

More details can be found in the following.

Call for Participation

This task invites enthusiasts in vision-and-language integration to push the boundaries of vision-based story generation. This shared task aims to generate stories grounded on a sequence of images. Participants will explore the intricate relations between visual prompts and textual stories, fostering innovations in techniques for coherent, visually grounded story generation. This task is particularly challenging because: 1) the output story should be a narratively coherent text with multiple sentences, and 2) the protagonists in the generated stories need to be grounded in the images. In addition, we also elicit submissions to fine-tune existing multimodal models efficiently, which is exciting for both industry and academics.

Major progress has been made by the industry in vision-based text generation in the last few years. While generating descriptions from images or videos is widely available, grounded story generation is not evaluated with a benchmark. Investigating vision-based story generation requires several capabilities including commonsense reasoning and content planning.

We're calling for submissions that demonstrate novel approaches to generating narratives that are not only coherent but also visually grounded. We particularly welcome participants to explore the following aspects of multimodal story generation: