Meta AI researchers have introduced Perception Encoder Audiovisual (PEAV), a novel family of open-source encoders engineered for comprehensive joint audio and video comprehension. This model establishes aligned representations for audio, video, and text within a singular embedding space, a feat achieved through extensive contrastive training on approximately 100 million audio-video pairs accompanied by descriptive text.
Architectural Foundations and Multimodal Fusion
PEAV builds upon Meta’s Perception Encoder (PE) framework, a robust vision stack known for state-of-the-art performance across image, video, and audio benchmarks. The PEAV architecture integrates distinct components including frame, video, audio, and text encoders. Notably, an audio-video fusion encoder is central to learning a shared representation for both streams. The audio pathway employs DAC VAE to convert raw waveforms into discrete tokens, while the video path leverages the existing PE frame encoder. This design culminates in a versatile backbone capable of diverse cross-modal queries, from retrieving video from text to finding audio from video, all without requiring specific retraining for each task.
Innovative Data Engine and Advanced Training
To overcome data limitations, Meta AI developed a two-stage data engine that generates high-quality synthetic captions for vast unlabeled clips. This process combines "weak" caption models and a large language model to produce audio, visual, and joint audiovisual captions. Subsequently, an initial PEAV model collaborates with a Perception Language Model decoder to refine these captions, ultimately yielding reliable supervision for 100 million diverse audio-video pairs. PEAV’s training leverages a generalized sigmoid-based contrastive loss, aligning up to ten different modality pairs (audio, video, text, and fused representations) in a unified space, enabling robust classification, retrieval, and correspondence tasks.
Setting New Performance Benchmarks
PEAV demonstrates exceptional zero-shot retrieval and classification performance across numerous audio and video domains, surpassing previous state-of-the-art models. Key advancements include significant improvements in:
- AudioCaps text-to-audio retrieval, increasing from 35.4 R@1 to 45.8 R@1.
- VGGSound clip-level classification accuracy, rising from 36.0 to 47.1.
- VCTK-style speech retrieval, reaching an impressive 85.6 accuracy.
- ActivityNet text-to-video retrieval, improving from 60.4 R@1 to 66.5 R@1.
- Kinetics 400 zero-shot video classification, boosting from 76.9 to 78.9, outperforming larger models.
PEA-Frame and Integration with SAM Audio
Alongside PEAV, Meta has also introduced Perception Encoder Audio Frame (PEA-Frame), an audio-text embedding model focused on precise sound event localization. PEA-Frame uses frame-level contrastive learning to align audio frames with text, enabling the identification of temporal spans for specific sounds within long audio sequences. Both PEAV and PEA-Frame are integral to Meta’s broader Perception Models stack and serve as the core perception engine behind the company’s new SAM Audio model, enabling prompt-based audio separation and quality assessment in complex sound environments.
Key Innovations
- Unified Multimodal Encoder: PEAV integrates audio, video, and text into a single embedding space, facilitating comprehensive cross-modal understanding.
- Large-Scale Synthetic Data: A two-stage data engine generates over 100 million high-quality synthetic audiovisual captions for robust training.
- State-of-the-Art Performance: PEAV achieves leading results across various audio and video benchmarks, outperforming existing models in zero-shot tasks.
- Core for SAM Audio: PEAV and its variant PEA-Frame provide essential perception capabilities for Meta’s advanced SAM Audio system, enabling precise sound localization and separation.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost