Estimated reading time: 5 minutes
Introduction
Embodied AI agents are increasingly being called upon to interpret complex, multimodal instructions and act robustly in dynamic environments. ThinkAct, presented by researchers from Nvidia and National Taiwan University, offers a breakthrough for vision-language-action (VLA) reasoning, introducing reinforced visual latent planning to bridge high-level multimodal reasoning…
Micromobility solutions—such as delivery robots, mobility scooters, and electric wheelchairs—are rapidly transforming short-distance urban travel. Despite their growing popularity as flexible, eco-friendly transport alternatives, most micromobility devices still rely heavily on human control. This dependence limits operational efficiency and raises safety concerns, especially in complex, crowded city environments filled with dynamic obstacles like pedestrians and…
Google DeepMind has unveiled Gemini Robotics On-Device, a compact, local version of its powerful vision-language-action (VLA) model, bringing advanced robotic intelligence directly onto devices. This marks a key step forward in the field of embodied AI by eliminating the need for continuous cloud connectivity while maintaining the flexibility, generality, and high precision associated with the…
Challenges in Dexterous Hand Manipulation Data Collection
Creating large-scale data for dexterous hand manipulation remains a major challenge in robotics. Although hands offer greater flexibility and richer manipulation potential than simpler tools, such as grippers, their complexity makes them difficult to control effectively. Many in the field have questioned whether dexterous hands are worth the…
The Challenge of Scaling 3D Environments in Embodied AI
Creating realistic and accurately scaled 3D environments is essential for training and evaluating embodied AI. However, current methods still rely on manually designed 3D graphics, which are costly and lack realism, thereby limiting scalability and generalization. Unlike internet-scale data used in models like GPT and CLIP,…
Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 demonstrates how self-supervised learning from passive internet video, combined with minimal robot interaction data, can yield…
Bridging Perception and Action in Robotics
Multimodal Large Language Models (MLLMs) hold promise for enabling machines, such as robotic arms and legged robots, to perceive their surroundings, interpret scenarios, and take meaningful actions. The integration of such intelligence into physical systems is advancing the field of robotics, pushing it toward autonomous machines that don’t just…
AI has advanced in language processing, mathematics, and code generation, but extending these capabilities to physical environments remains challenging. Physical AI seeks to close this gap by developing systems that perceive, understand, and act in dynamic, real-world settings. Unlike conventional AI that processes text or symbols, Physical AI engages with sensory inputs, especially video, and…
Designing intelligent systems that function reliably in dynamic physical environments remains one of the more difficult frontiers in AI. While significant advances have been made in perception and planning within simulated or controlled contexts, the real world is noisy, unpredictable, and resistant to abstraction. Traditional AI systems often rely on high-level representations detached from their…
Marine robotic platforms support various applications, including marine exploration, underwater infrastructure inspection, and ocean environment monitoring. While reliable perception systems enable robots to sense their surroundings, detect objects, and navigate complex underwater terrains independently, developing these systems presents unique difficulties compared to their terrestrial counterparts. Collecting real-world underwater data requires complex hardware, controlled experimental setups,…