Tooliax Logo
ExploreCompareCategoriesSubmit Tool
News
Tooliax Logo
ExploreCompareCategoriesSubmit Tool
News
Revolutionizing Robotics: Ant Group's LingBot-VLA Unleashes Universal Control for Real-World Manipulation
Back to News
Friday, January 30, 20264 min read

Revolutionizing Robotics: Ant Group's LingBot-VLA Unleashes Universal Control for Real-World Manipulation

Ant Group's Robbyant division has unveiled LingBot-VLA, a novel Vision Language Action (VLA) foundation model positioned to transform real-world robot manipulation. This advanced model is specifically engineered to enable a single artificial intelligence system to control numerous distinct dual-arm robots, addressing a significant challenge in contemporary robotics.

Extensive Training and Diverse Data Sources

The development of LingBot-VLA involved an intensive training regimen, utilizing approximately 20,000 hours of bimanual teleoperated data. This vast dataset was gathered from nine different dual-arm robot configurations, ensuring broad applicability. These included popular systems such as AgiBot G1, AgileX, Galaxea R1Lite, Galaxea R1Pro, Realman Rs 02, Leju KUAVO 4 Pro, Qinglong humanoid, ARX Lift2, and a Bimanual Franka setup. Each robot system featured dual arms with six or seven degrees of freedom, parallel grippers, and multiple RGB-D cameras for comprehensive environmental perception.

The data collection process employed various teleoperation methods, including VR control for AgiBot G1 and isomorphic arm control for AgileX. Recorded videos were meticulously segmented by human annotators into clips representing atomic actions. Task-level and sub-task-level language instructions were then generated using the Qwen3-VL-235B-A22B model, creating synchronized sequences of images, instructions, and action trajectories. Notably, the evaluation criteria deliberately included a significant portion of novel actions, with roughly half of the test set's atomic actions not appearing in the top 100 frequent training actions, thereby ensuring robust cross-task generalization rather than mere memorization.

Innovative Architecture and Action Representation

LingBot-VLA's architecture incorporates a sophisticated Mixture of Transformers, combining a robust multimodal backbone with an action expert. The visual language backbone, Qwen2.5-VL, processes multi-view operational images and natural language instructions into multimodal tokens. Simultaneously, the action expert receives proprioceptive robot state information and segments of past actions. Both components interact through a shared self-attention module, facilitating layer-wise joint sequence modeling across observation and action tokens.

The model constructs an observation sequence at each timestep by concatenating tokens from camera views, task instructions, and robot state. The action sequence comprises a future action chunk, with a temporal horizon set to 50 during pre-training. Conditional Flow Matching serves as the training objective, enabling the model to learn a vector field for transforming Gaussian noise into precise, ground-truth action trajectories. This yields a continuous action representation, vital for generating smooth and temporally coherent control necessary for accurate dual-arm manipulation.

Enhanced Spatial Perception with LingBot-Depth

A key enhancement in LingBot-VLA is its integration of LingBot-Depth, a separate spatial perception model, to overcome common VLA model limitations in depth reasoning, particularly when sensor data is sparse or incomplete. LingBot-Depth is trained self-supervisingly on an extensive RGB-D corpus, mastering the reconstruction of dense metric depth from partially masked depth maps—a crucial capability in challenging environments.

Within LingBot-VLA, visual queries from each camera view are aligned with LingBot-Depth tokens via a projection layer and a distillation loss. This process effectively embeds geometry-aware information into the policy, substantially improving performance on tasks demanding accurate 3D spatial reasoning, such as precise insertion, stacking, and folding operations amidst clutter and occlusion.

Benchmark-Setting Performance and Data Efficiency

LingBot-VLA underwent rigorous evaluation on the GM-100 real-world benchmark, encompassing 100 manipulation tasks across three hardware platforms. Under a standardized post-training protocol, LingBot-VLA, equipped with its depth integration, achieved an impressive 17.30 percent average Success Rate and a 35.41 percent average Progress Score. These figures represent state-of-the-art performance, surpassing competitors such as π0.5 (13.02% SR, 27.65% PS), GR00T N1.6 (7.59% SR, 15.99% PS), and WALL-OSS (4.05% SR, 10.35% PS). Even without the depth component, LingBot-VLA demonstrated superior performance over GR00T N1.6 and WALL-OSS.

Further analysis revealed compelling scaling behavior, with performance consistently improving as pre-training data volume increased, showing no saturation up to 20,000 hours. The model also exhibited remarkable data efficiency during post-training; on the AgiBot G1 platform, LingBot-VLA achieved superior results with only 80 demonstrations per task, outperforming π0.5 which utilized a full set of 130 demonstrations. This capability significantly reduces the adaptation cost for new robotic systems or tasks.

Optimized Training and Open-Source Availability

The LingBot-VLA framework includes an optimized training stack designed for multi-node efficiency. This codebase leverages strategies like FSDP for parameters and optimizer states, hybrid sharding for the action expert, and mixed precision with bfloat16 storage. These optimizations result in a throughput of 261 samples per second per GPU for specific model configurations, representing a 1.5 to 2.8 times speedup compared to existing VLA-oriented codebases. The throughput scales nearly linearly from 8 to 256 GPUs, demonstrating its high scalability. The comprehensive post-training toolkit has been made available as open-source, promoting broader adoption and further innovation in the robotics community.

This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.

Source: MarkTechPost
Share this article

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Feb 3

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

Feb 3

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Feb 3

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Feb 3

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

Feb 3

View All News

More News

Europe's Tech Ecosystem Surges: Five New Unicorns Emerge in January 2026

February 2, 2026

Europe's Tech Ecosystem Surges: Five New Unicorns Emerge in January 2026

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

February 2, 2026

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

AI Unlocks Self-Healing Interfaces: The Future of Automated UI/UX Optimization

February 2, 2026

AI Unlocks Self-Healing Interfaces: The Future of Automated UI/UX Optimization

Tooliax LogoTooliax

Your comprehensive directory for discovering, comparing, and exploring the best AI tools available.

Quick Links

  • Explore Tools
  • Compare
  • Submit Tool
  • About Us

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy
  • Contact

© 2026 Tooliax. All rights reserved.