Pioneering Language-Driven Motion Prediction
Salesforce AI’s latest research introduces FOFPred, an advanced framework designed to forecast dense motion using natural language commands. This novel system integrates sophisticated vision-language models with diffusion transformers, enabling precise prediction of pixel movements for improved robot operational control and compelling video generation.
FOFPred processes one or more initial images alongside a descriptive text input, such as “moving the bottle from right to left.” Based on these inputs, it generates four subsequent optical flow frames, delineating the anticipated movement of each pixel over time.
The Essence of Future Optical Flow
Optical flow quantifies the apparent per-pixel displacement between frames. FOFPred focuses on predicting future optical flow, meaning it forecasts detailed displacement fields for upcoming frames solely from current observations and text, without needing future visual data during inference.
This “motion-only” representation offers a compact and efficient way to describe movement. By isolating pixel-level motion and disregarding static visual elements, it serves as an ideal intermediate state for robot control strategies and as a guiding signal for advanced video diffusion models. This approach also simplifies the output distribution compared to predicting entire future RGB frames, avoiding the complexities of modeling textures and fine visual details irrelevant to motion planning.
Unified Architectural Design
FOFPred employs a cohesive architecture comprising a frozen vision-language model, a static variational autoencoder (VAE), and a trainable diffusion transformer. The processing pipeline integrates:
- Qwen2.5-VL as the vision-language encoder for jointly processing text captions and visual inputs.
- Flux.1 VAE, responsible for encoding input images and target optical flow sequences into latent representations.
- An OmniGen-inspired diffusion transformer (DiT), which accepts projected visual and textual features to produce latent future flow sequences.
Crucially, only the DiT and minor multi-layer perceptron (MLP) projectors undergo training. The weights for Qwen2.5-VL and Flux.1 remain frozen, allowing the system to leverage existing image editing pretraining and multimodal reasoning capacities. Temporal modeling is seamlessly integrated by extending RoPE positional encoding and attention blocks to encompass full spatio-temporal positions, adding no extra parameters while directly utilizing OmniGen image pretraining.
Robust Training with Relative Optical Flow
The FOFPred model is trained on extensive web-scale human activity videos paired with descriptive captions, utilizing approximately 500,000 video-caption pairs from datasets like Something Something V2 and EgoDex.
Training employs an end-to-end flow matching objective within the latent space. Future optical flow sequences are pre-computed offline, encoded by the VAE, and used as targets for a diffusion loss. The methodology also incorporates classifier-free guidance for both text and visual conditions, masking specific frames or viewpoints to bolster robustness.
A critical innovation is the relative optical flow calculation, which generates pristine training targets from inherently noisy egocentric videos. For each frame pair, the system:
- Determines dense optical flow.
- Estimates camera motion via homography from deep features.
- Subtracts camera motion using projective geometry to obtain object-centric relative flow vectors.
- Filters frame pairs, selecting those with significant motion to focus training.
These operations are initially executed offline at lower resolution for efficiency, then recomputed at the original resolution for final targets. Ablation studies confirmed that static frame targets or raw flow without camera motion compensation negatively impacts performance, while disentangled relative flow targets deliver superior outcomes.
Transformative Applications
Language-Driven Robot Manipulation
FOFPred significantly enhances robot control. Fine-tuned with robot video and caption data, it predicts future optical flow from both fixed and wrist-mounted cameras. A diffusion policy network then builds upon these predictions, integrating flow, text instructions, and robot state to generate continuous actions. This refines existing diffusion policy approaches by utilizing future optical flow as its central representation.
On the CALVIN ABCD benchmark, FOFPred achieved an average chain length of 4.48, surpassing VPP (4.33) and DreamVLA (4.44). It also recorded a 78.7 percent success rate for Task 5, making it a leading method. In data-scarce scenarios (10 percent of CALVIN demonstrations), FOFPred still achieved a 3.43 average length, outperforming VPP's 3.25.
For RoboTwin 2.0, a dual-arm manipulation benchmark, FOFPred demonstrated an average success rate of 68.6 percent, improving upon the VPP baseline’s 61.8 percent and showing better performance across all tasks.
Motion-Aware Text-to-Video Generation
FOFPred also excels in controlling motion during text-to-video generation. Researchers developed a two-stage pipeline by integrating FOFPred with the “Go with the Flow” video diffusion model. FOFPred takes an initial frame and a linguistic motion description, predicting a sequence of future flow frames which are then interpolated into a dense motion field. “Go with the Flow” subsequently uses this motion field and initial frame to synthesize the video, precisely adhering to the described motion pattern.
On the motion-intensive Something Something V2 benchmark, the FOFPred and “Go with the Flow” pipeline showed marked improvements over the CogVideoX baseline. It achieved superior metrics including SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and motion fidelity 0.662. Notably, FOFPred only requires language and a single frame during inference, unlike several comparable baselines that demand additional inputs like masks or trajectories.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost