In this tutorial, we explore advanced computer vision techniques using TorchVision’s v2 transforms, modern augmentation strategies, and powerful training enhancements. We walk through the process of building an augmentation pipeline, applying MixUp and CutMix, designing a modern CNN with attention, and implementing a robust training loop. By running everything seamlessly in Google Colab, we position…
Acknowledgements This work was developed by the Gemini Robotics team: Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, Michael Bloesch, Konstantinos Bousmalis, Philemon Brakel, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Christine Chan, Oscar Chang, London Chappellet-Volpini, Jose Enrique…
Can a single AI stack plan like a researcher, reason over scenes, and transfer motions across different robots—without retraining from scratch? Google DeepMind’s Gemini Robotics 1.5 says yes, by splitting embodied intelligence into two models: Gemini Robotics-ER 1.5 for high-level embodied reasoning (spatial understanding, planning, progress/success estimation, tool-use) and Gemini Robotics 1.5 for low-level visuomotor…
Image by Author | Canva
# Introduction
Finding real-world datasets can be challenging because they are often private (protected), incomplete (missing features), or expensive (behind a paywall). Synthetic datasets can solve these problems by letting you generate the data based on your project needs.
Synthetic data is artificially generated information that mimics real-life…
Computer vision moved fast in 2025: new multimodal backbones, larger open datasets, and tighter model–systems integration. Practitioners need sources that publish rigorously, link code and benchmarks, and track deployment patterns—not marketing posts. This list prioritizes primary research hubs, lab blogs, and production-oriented engineering outlets with consistent update cadence. Use it to monitor SOTA shifts, grab…
We’re expanding our risk domains and refining our risk assessment process. AI breakthroughs are transforming our everyday lives, from advancing mathematics, biology and astronomy to realizing the potential of personalized education. As we build increasingly powerful AI models, we’re committed to responsibly developing our technologies and taking an evidence-based approach to staying ahead of emerging…
In this tutorial, we walk step by step through using Hugging Face’s LeRobot library to train and evaluate a behavior-cloning policy on the PushT dataset. We begin by setting up the environment in Google Colab, installing the required dependencies, and loading the dataset through LeRobot’s unified API. We then design a compact visuomotor policy that…
Image by Editor | ChatGPT
# Introduction
Ready for a practical walkthrough with little to no code involved, depending on the approach you choose? This tutorial shows how to tie together two formidable tools — OpenAI's GPT models and the Airtable cloud-based database — to prototype a simple, toy-sized retrieval-augmented generation (RAG) system.…
IBM has released Granite-Docling-258M, an open-source (Apache-2.0) vision-language model designed specifically for end-to-end document conversion. The model targets layout-faithful extraction—tables, code, equations, lists, captions, and reading order—emitting a structured, machine-readable representation rather than lossy Markdown. It is available on Hugging Face with a live demo and MLX build for Apple Silicon.
What’s new compared to…
Acknowledgements We thank the International Collegiate Programming Contest (ICPC) for their support. This project was a large-scale collaboration, and its success is due to the combined efforts of many individuals and teams. Hanzhao (Maggie) Lin led the overall technical direction for Gemini competitive programming and ICPC 2025 efforts, and co-led with Heng-Tze Cheng on the…