Skip to content Skip to sidebar Skip to footer

How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality

Introduction to Video Diffusion Models and Computational Challenges Diffusion models have made impressive progress in generating high-quality, coherent videos, building on their success in image synthesis. However, handling the extra temporal dimension in videos significantly increases computational demands, especially since self-attention scales poorly with sequence length. This makes it difficult to train or run these…

Read More

Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video Control

Key Takeaways: Researchers from Google DeepMind, the University of Michigan & Brown university have developed “Motion Prompting,” a new method for controlling video generation using specific motion trajectories. The technique uses “motion prompts,” a flexible representation of movement that can be either sparse or dense, to guide a pre-trained video diffusion model. A key innovation…

Read More

ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

Why Multimodal Reasoning Matters for Vision-Language Tasks Multimodal reasoning enables models to make informed decisions and answer questions by combining both visual and textual information. This type of reasoning plays a central role in interpreting charts, answering image-based questions, and understanding complex visual documents. The goal is to make machines capable of using vision as…

Read More

BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI

Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the new architecture unifies text-to-image generation, image editing, and subject-driven generation within a single transformer framework. It innovates by decoupling the modeling of text and image generation, incorporating a reflective training mechanism, and implementing a purpose-built…

Read More

EPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied Environments

Navigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. The towering skyscrapers block and reflect satellite signals, leading to location errors of tens of meters. For you and me, that might mean a missed turn. But for an autonomous vehicle or a delivery robot,…

Read More

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality

Despite the substantial progress in text-to-image (T2I) generation brought about by models such as DALL-E 3, Imagen 3, and Stable Diffusion 3, achieving consistent output quality — both in aesthetic and alignment terms — remains a persistent challenge. While large-scale pretraining provides general knowledge, it is insufficient to achieve high aesthetic quality and alignment. Supervised…

Read More

National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation

In recent months, there has been growing interest in applying diffusion models—originally designed for continuous data, such as images—to natural language processing tasks. This has led to the development of Discrete Diffusion Language Models (DLMs), which treat text generation as a denoising process. Unlike traditional autoregressive models, DLMs enable parallel decoding and provide better control…

Read More

Samsung Researchers Introduced ANSE (Active Noise Selection for Generation): A Model-Aware Framework for Improving Text-to-Video Diffusion Models through Attention-Based Uncertainty Estimation

Video generation models have become a core technology for creating dynamic content by transforming text prompts into high-quality video sequences. Diffusion models, in particular, have established themselves as a leading approach for this task. These models work by starting from random noise and iteratively refining it into realistic video frames. Text-to-video (T2V) models extend this…

Read More

This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding

The core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major…

Read More

Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single Images

Manipulating lighting conditions in images post-capture is challenging. Traditional approaches rely on 3D graphics methods that reconstruct scene geometry and properties from multiple captures before simulating new lighting using physical illumination models. Though these techniques provide explicit control over light sources, recovering accurate 3D models from single images remains a problem that frequently results in…

Read More