Microsoft AI proposes an automated pipeline that leverages GPT-4V(ision) to generate accurate audio description displays for videos

The introduction of audio description (AD) marks a major step towards improving the accessibility of video content. AD provides a spoken narration of important visual elements within a video that are not available in the original video track. However, creating accurate AD requires many resources, such as: B. special expertise, equipment and a significant amount of time. Additionally, automating AD production improves video accessibility for people with visual impairments. Nevertheless, a major challenge in automating AD is to generate correctly sized sentences that fit the various temporal gaps in the actors’ dialogue.

Recently, large multimodal models (LMMs) have gained popularity in artificial intelligence, mainly focusing on integrating various types of data, including text, images, audio and video, to become more reliable and intelligent. For example, GPT-4V is an LLM model that adds vision potential to the large language model GPT-4. Additionally, a method called MM-VID pioneered the use of the GPT-4V model for AD generation using a two-stage method. This process includes synthesizing compressed image captions and refining the final AD output using GPT-4. Unfortunately, these methods do not have an explicit character recognition process.

A team at Microsoft has introduced an automated pipeline that uses GPT-4V(ision) to generate accurate AD for videos. This method uses a movie clip and its title information to generate AD content and leverages the multimodal capabilities of GPT-4V by integrating visual signals from video images with text context to generate AD content. This method helps adjust the size of the AD to the language gap and adapt it to different types of videos by providing input into AD production guidelines that show in a simple, natural way how long the sentence should be.

The proposed method is tested on the MAD dataset, which contains a large collection of over 264,000 audio descriptions from 488 movies. When developing this method of generating character tracklets, a simple version of the multi-person tracker is used, which captures all characters appearing in the input movie clip. The further process uses TransNetV2 to detect and break clips that contain multiple shots, and after generating the tracklet, square patches around each person are extracted from the frames. Within the visual fields, face recognition is performed using the YOLOv7 model, which facilitates cropping and alignment of visual fields to a standard size of 112 × 112 pixels.

GPT-4V was instructed to generate all AD in word counts, e.g. B. 6, 10 and 20 words, with the performance results. In the AudioVault dataset, 80% of the AD contains ten words or fewer, 99% of the AD is limited to 20 words, and the choice of 6 words represents the average word count of the dataset. The results show that the 10-word prompts have the highest ROUGE-L and CIDEr scores compared to the fixed-word prompts of 6, 10, and 20. The proposed method outperforms AutoAD-II, setting a new state-of-the-art performance with CIDEr and ROUGE-L values ​​of 20.5 (vs. 19.5) and 13.5 (vs. 13.4), respectively.

In summary, a team at Microsoft proposed an automated pipeline that leverages GPT-4V(ision) to generate accurate video AD. This method outperforms several methods in this paper, such as AutoAD-II, with CIDEr and ROUGE-L scores of 20.5 (vs. 19.5) and 13.5 (vs. 13.4), respectively. However, the proposed method lacks a mechanism to determine appropriate moments within a movie to insert AD and estimate the corresponding word count for that AD. Therefore, in the future, there is a need to improve the generated AD quality, e.g. B. one can adapt a lightweight language rewriting model using available AD data to improve the output of the LLM.

Visit the Paper. All credit for this research goes to the researchers of this project. Also don’t forget to follow us Twitter. Join our… Telegram channel, Discord channelAnd LinkedIn Grupp.

If you like our work, you will love ours Newsletter..

Don’t forget to join our 41k+ ML SubReddit

Sajjad Ansari is a final year student at IIT Kharagpur. As a technology enthusiast, he is concerned with the practical applications of AI, with a focus on understanding the impact of AI technologies and their impact on the real world. His goal is to formulate complex AI concepts clearly and understandably.

Source link