Meta unveils Movie Gen text-to-video and sound AI model


Meta on Friday introduced Movie Gen, an AI-powered text-to-video and sound generator designed to create and edit videos based on text inputs. Movie Gen allows users to transform photos into videos and generate or extend soundtracks based on prompts.

This launch positions Meta’s tool alongside leading media generation platforms, including OpenAI’s Sora, which has yet to be publicly released.

How Movie Gen Came About

Meta aims to democratize creativity, stating that whether someone is an aspiring filmmaker or a casual content creator, everyone should have “access to tools that help enhance their creativity.”

According to their latest research, Movie Gen enables users to produce custom videos and sounds using simple text inputs. In comparison, tests, Movie Gen outperformed other models in the industry.

This tool is part of Meta’s ongoing commitment to sharing AI research with the public. Meta’s journey began with the “Make-A-Scene” series, which allowed users to create images, audio, video, and 3D animations. With diffusion models, Meta advanced to the Llama Image foundation models, enabling higher-quality image and video generation.

Movie Gen represents the third phase of this development, merging multiple modalities to provide users with more control than ever before. Meta stresses that while generative AI offers exciting applications, it is not a replacement for artists and animators.

Instead, Movie Gen aims to empower users to express themselves creatively and produce high-definition videos and audio.

Key Features of Movie Gen

Movie Gen offers four main capabilities:

1. Video Generation: Movie Gen uses a 30B parameter transformer model to generate videos up to 16 seconds long at 16 frames per second. It integrates both text-to-image and text-to-video techniques, handling object motion, subject interactions, and camera movements with precision.

2. Personalized Video Generation: Meta’s tool can take an individual’s image and create a personalized video using text prompts. This feature excels at preserving human identity and motion, according to Meta.

3. Precise Video Editing: Movie Gen allows users to edit videos with high accuracy, supporting both localized edits (e.g., adding or removing elements) and global edits (e.g., changing backgrounds or styles) without affecting the overall content.

4. Audio Generation: Meta trained a 13B parameter model to generate audio up to 45 seconds long, including sound effects, background music, and ambient sounds. All audio is synced with the video, and an audio extension feature enables coherent sound generation for longer videos.

Results and Innovations

Meta’s foundation models have driven technical innovations in architecture, training methods, and evaluation protocols. Human evaluators consistently preferred Movie Gen over industry alternatives across the four main capabilities. Meta has shared a detailed 92-page research paper outlining the technical insights of Movie Gen.

Despite its promising potential, Meta acknowledges that Movie Gen has some limitations, including long generation times and the need for further optimization. They are actively working on improving these aspects as development continues.

Looking Ahead

Meta plans to collaborate with filmmakers and creators to refine Movie Gen based on user feedback. The company envisions a future where users can create personalized videos, share content on platforms like Reels, or generate custom animations for apps like WhatsApp.

Meta’s chief product officer, Chris Cox, shared on Threads that Movie Gen is not ready for public release due to high costs and slow processing times, though the results are promising. Meanwhile, CEO Mark Zuckerberg announced that Movie Gen will be coming to Instagram next year, showcasing a video created by the tool.