Tooliax Logo
ExploreCompareCategoriesSubmit Tool
News
Tooliax Logo
ExploreCompareCategoriesSubmit Tool
News
Meta AI Unleashes PEAV: A Multimodal Breakthrough for Audio, Video, and Text Understanding
Back to News
Wednesday, December 24, 20253 min read

Meta AI Unleashes PEAV: A Multimodal Breakthrough for Audio, Video, and Text Understanding

Meta AI researchers have introduced Perception Encoder Audiovisual (PEAV), a novel family of open-source encoders engineered for comprehensive joint audio and video comprehension. This model establishes aligned representations for audio, video, and text within a singular embedding space, a feat achieved through extensive contrastive training on approximately 100 million audio-video pairs accompanied by descriptive text.

Architectural Foundations and Multimodal Fusion

PEAV builds upon Meta’s Perception Encoder (PE) framework, a robust vision stack known for state-of-the-art performance across image, video, and audio benchmarks. The PEAV architecture integrates distinct components including frame, video, audio, and text encoders. Notably, an audio-video fusion encoder is central to learning a shared representation for both streams. The audio pathway employs DAC VAE to convert raw waveforms into discrete tokens, while the video path leverages the existing PE frame encoder. This design culminates in a versatile backbone capable of diverse cross-modal queries, from retrieving video from text to finding audio from video, all without requiring specific retraining for each task.

Innovative Data Engine and Advanced Training

To overcome data limitations, Meta AI developed a two-stage data engine that generates high-quality synthetic captions for vast unlabeled clips. This process combines "weak" caption models and a large language model to produce audio, visual, and joint audiovisual captions. Subsequently, an initial PEAV model collaborates with a Perception Language Model decoder to refine these captions, ultimately yielding reliable supervision for 100 million diverse audio-video pairs. PEAV’s training leverages a generalized sigmoid-based contrastive loss, aligning up to ten different modality pairs (audio, video, text, and fused representations) in a unified space, enabling robust classification, retrieval, and correspondence tasks.

Setting New Performance Benchmarks

PEAV demonstrates exceptional zero-shot retrieval and classification performance across numerous audio and video domains, surpassing previous state-of-the-art models. Key advancements include significant improvements in:

  • AudioCaps text-to-audio retrieval, increasing from 35.4 R@1 to 45.8 R@1.
  • VGGSound clip-level classification accuracy, rising from 36.0 to 47.1.
  • VCTK-style speech retrieval, reaching an impressive 85.6 accuracy.
  • ActivityNet text-to-video retrieval, improving from 60.4 R@1 to 66.5 R@1.
  • Kinetics 400 zero-shot video classification, boosting from 76.9 to 78.9, outperforming larger models.

PEA-Frame and Integration with SAM Audio

Alongside PEAV, Meta has also introduced Perception Encoder Audio Frame (PEA-Frame), an audio-text embedding model focused on precise sound event localization. PEA-Frame uses frame-level contrastive learning to align audio frames with text, enabling the identification of temporal spans for specific sounds within long audio sequences. Both PEAV and PEA-Frame are integral to Meta’s broader Perception Models stack and serve as the core perception engine behind the company’s new SAM Audio model, enabling prompt-based audio separation and quality assessment in complex sound environments.

Key Innovations

  • Unified Multimodal Encoder: PEAV integrates audio, video, and text into a single embedding space, facilitating comprehensive cross-modal understanding.
  • Large-Scale Synthetic Data: A two-stage data engine generates over 100 million high-quality synthetic audiovisual captions for robust training.
  • State-of-the-Art Performance: PEAV achieves leading results across various audio and video benchmarks, outperforming existing models in zero-shot tasks.
  • Core for SAM Audio: PEAV and its variant PEA-Frame provide essential perception capabilities for Meta’s advanced SAM Audio system, enabling precise sound localization and separation.

This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.

Source: MarkTechPost
Share this article

Latest News

Unlocking Smart Logistics: AI Agents Deliver Precision Routing for Supply Chains

Unlocking Smart Logistics: AI Agents Deliver Precision Routing for Supply Chains

Feb 22

Microsoft Gaming Unveils Bold New Direction: Phil Spencer Retires, AI Strategist Named CEO

Microsoft Gaming Unveils Bold New Direction: Phil Spencer Retires, AI Strategist Named CEO

Feb 21

Microsoft Appoints AI Visionary Asha Sharma to Lead Xbox, Signaling Major Strategic Shift

Microsoft Appoints AI Visionary Asha Sharma to Lead Xbox, Signaling Major Strategic Shift

Feb 21

Autonomous Vehicles Unmasked: Tesla & Waymo Robotaxis Still Require Human Remote Support

Autonomous Vehicles Unmasked: Tesla & Waymo Robotaxis Still Require Human Remote Support

Feb 21

Groundbreaking Split: National PTA Rejects Meta Partnership Amid Child Safety Storm

Groundbreaking Split: National PTA Rejects Meta Partnership Amid Child Safety Storm

Feb 21

View All News

More News

No specific recent news found.

Tooliax LogoTooliax

Your comprehensive directory for discovering, comparing, and exploring the best AI tools available.

Quick Links

  • Explore Tools
  • Compare
  • Submit Tool
  • About Us

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • Contact

© 2026 Tooliax. All rights reserved.