Salesforce AI's FOFPred: Revolutionizing Robotics and Video Generation with Language-Guided Motion

Pioneering Language-Driven Motion Prediction

Salesforce AI’s latest research introduces FOFPred, an advanced framework designed to forecast dense motion using natural language commands. This novel system integrates sophisticated vision-language models with diffusion transformers, enabling precise prediction of pixel movements for improved robot operational control and compelling video generation.

FOFPred processes one or more initial images alongside a descriptive text input, such as “moving the bottle from right to left.” Based on these inputs, it generates four subsequent optical flow frames, delineating the anticipated movement of each pixel over time.

The Essence of Future Optical Flow

Optical flow quantifies the apparent per-pixel displacement between frames. FOFPred focuses on predicting future optical flow, meaning it forecasts detailed displacement fields for upcoming frames solely from current observations and text, without needing future visual data during inference.

This “motion-only” representation offers a compact and efficient way to describe movement. By isolating pixel-level motion and disregarding static visual elements, it serves as an ideal intermediate state for robot control strategies and as a guiding signal for advanced video diffusion models. This approach also simplifies the output distribution compared to predicting entire future RGB frames, avoiding the complexities of modeling textures and fine visual details irrelevant to motion planning.

Unified Architectural Design

FOFPred employs a cohesive architecture comprising a frozen vision-language model, a static variational autoencoder (VAE), and a trainable diffusion transformer. The processing pipeline integrates:

Qwen2.5-VL as the vision-language encoder for jointly processing text captions and visual inputs.
Flux.1 VAE, responsible for encoding input images and target optical flow sequences into latent representations.
An OmniGen-inspired diffusion transformer (DiT), which accepts projected visual and textual features to produce latent future flow sequences.

Crucially, only the DiT and minor multi-layer perceptron (MLP) projectors undergo training. The weights for Qwen2.5-VL and Flux.1 remain frozen, allowing the system to leverage existing image editing pretraining and multimodal reasoning capacities. Temporal modeling is seamlessly integrated by extending RoPE positional encoding and attention blocks to encompass full spatio-temporal positions, adding no extra parameters while directly utilizing OmniGen image pretraining.

Robust Training with Relative Optical Flow

The FOFPred model is trained on extensive web-scale human activity videos paired with descriptive captions, utilizing approximately 500,000 video-caption pairs from datasets like Something Something V2 and EgoDex.

Training employs an end-to-end flow matching objective within the latent space. Future optical flow sequences are pre-computed offline, encoded by the VAE, and used as targets for a diffusion loss. The methodology also incorporates classifier-free guidance for both text and visual conditions, masking specific frames or viewpoints to bolster robustness.

A critical innovation is the relative optical flow calculation, which generates pristine training targets from inherently noisy egocentric videos. For each frame pair, the system:

Determines dense optical flow.
Estimates camera motion via homography from deep features.
Subtracts camera motion using projective geometry to obtain object-centric relative flow vectors.
Filters frame pairs, selecting those with significant motion to focus training.

These operations are initially executed offline at lower resolution for efficiency, then recomputed at the original resolution for final targets. Ablation studies confirmed that static frame targets or raw flow without camera motion compensation negatively impacts performance, while disentangled relative flow targets deliver superior outcomes.

Transformative Applications

Language-Driven Robot Manipulation

FOFPred significantly enhances robot control. Fine-tuned with robot video and caption data, it predicts future optical flow from both fixed and wrist-mounted cameras. A diffusion policy network then builds upon these predictions, integrating flow, text instructions, and robot state to generate continuous actions. This refines existing diffusion policy approaches by utilizing future optical flow as its central representation.

On the CALVIN ABCD benchmark, FOFPred achieved an average chain length of 4.48, surpassing VPP (4.33) and DreamVLA (4.44). It also recorded a 78.7 percent success rate for Task 5, making it a leading method. In data-scarce scenarios (10 percent of CALVIN demonstrations), FOFPred still achieved a 3.43 average length, outperforming VPP's 3.25.

For RoboTwin 2.0, a dual-arm manipulation benchmark, FOFPred demonstrated an average success rate of 68.6 percent, improving upon the VPP baseline’s 61.8 percent and showing better performance across all tasks.

Motion-Aware Text-to-Video Generation

FOFPred also excels in controlling motion during text-to-video generation. Researchers developed a two-stage pipeline by integrating FOFPred with the “Go with the Flow” video diffusion model. FOFPred takes an initial frame and a linguistic motion description, predicting a sequence of future flow frames which are then interpolated into a dense motion field. “Go with the Flow” subsequently uses this motion field and initial frame to synthesize the video, precisely adhering to the described motion pattern.

On the motion-intensive Something Something V2 benchmark, the FOFPred and “Go with the Flow” pipeline showed marked improvements over the CogVideoX baseline. It achieved superior metrics including SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and motion fidelity 0.662. Notably, FOFPred only requires language and a single frame during inference, unlike several comparable baselines that demand additional inputs like masks or trajectories.

Pioneering Language-Driven Motion Prediction

The Essence of Future Optical Flow

Unified Architectural Design

FOFPred employs a cohesive architecture comprising a frozen vision-language model, a static variational autoencoder (VAE), and a trainable diffusion transformer. The processing pipeline integrates:

Qwen2.5-VL as the vision-language encoder for jointly processing text captions and visual inputs.
Flux.1 VAE, responsible for encoding input images and target optical flow sequences into latent representations.
An OmniGen-inspired diffusion transformer (DiT), which accepts projected visual and textual features to produce latent future flow sequences.

Robust Training with Relative Optical Flow

A critical innovation is the relative optical flow calculation, which generates pristine training targets from inherently noisy egocentric videos. For each frame pair, the system:

Determines dense optical flow.
Estimates camera motion via homography from deep features.
Subtracts camera motion using projective geometry to obtain object-centric relative flow vectors.
Filters frame pairs, selecting those with significant motion to focus training.

Salesforce AI's FOFPred: Revolutionizing Robotics and Video Generation with Language-Guided Motion

Pioneering Language-Driven Motion Prediction

The Essence of Future Optical Flow

Unified Architectural Design

Robust Training with Relative Optical Flow

Transformative Applications

Language-Driven Robot Manipulation

Motion-Aware Text-to-Video Generation

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

More News

Generative AI Transforms Customer Segmentation, Bridging the Gap Between Data and Actionable Strategy

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

India's Zero-Tax Gambit: A 23-Year Incentive to Lure Global AI Infrastructure

Salesforce AI's FOFPred: Revolutionizing Robotics and Video Generation with Language-Guided Motion

Pioneering Language-Driven Motion Prediction

The Essence of Future Optical Flow

Unified Architectural Design

Robust Training with Relative Optical Flow

Transformative Applications

Language-Driven Robot Manipulation

Motion-Aware Text-to-Video Generation

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

More News

Generative AI Transforms Customer Segmentation, Bridging the Gap Between Data and Actionable Strategy

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

India's Zero-Tax Gambit: A 23-Year Incentive to Lure Global AI Infrastructure

Salesforce AI's FOFPred: Revolutionizing Robotics and Video Generation with Language-Guided Motion

Pioneering Language-Driven Motion Prediction

The Essence of Future Optical Flow

Unified Architectural Design

Robust Training with Relative Optical Flow

Transformative Applications

Language-Driven Robot Manipulation

Motion-Aware Text-to-Video Generation

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

More News

Generative AI Transforms Customer Segmentation, Bridging the Gap Between Data and Actionable Strategy

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

India's Zero-Tax Gambit: A 23-Year Incentive to Lure Global AI Infrastructure

Salesforce AI's FOFPred: Revolutionizing Robotics and Video Generation with Language-Guided Motion

Pioneering Language-Driven Motion Prediction

The Essence of Future Optical Flow

Unified Architectural Design

Robust Training with Relative Optical Flow

Transformative Applications

Language-Driven Robot Manipulation

Motion-Aware Text-to-Video Generation

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

More News

Generative AI Transforms Customer Segmentation, Bridging the Gap Between Data and Actionable Strategy

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

India's Zero-Tax Gambit: A 23-Year Incentive to Lure Global AI Infrastructure

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance