Awesome-Multimodal-Reasoning

This is a repository for organizing papers related to Multimodal Reasoning in Multimodal Large Language Models (Image, Video).

With the development of the visual (audio) capabilities and reasoning capabilities (RL powered) of multimodal large language models(MLLMs/LVLMs/LSLMs), researchers have high hopes for the multimodal reasoning capabilities of MLLM/LVLM/LSLM.

This repo also select paper about visual generation (image generation/video generation) with RL/CoT.

This repository aims to cover ALL possibly relevant papers to support research surveys, not just selected best papers. We believe comprehensive coverage is more valuable for researchers than a curated list.

⭐ If you find this list useful, welcome to star it!

Paper List (Updating...)

Survey

(8 May 2025) Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

(30 Apr 2025) Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models

(4 Apr 2025) Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning

(18 Mar 2025) Aligning Multimodal LLM with Human Preference: A Survey

(16 Mar 2025) Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Image Reasoning

(13 Apr 2026) POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

(13 Apr 2026) CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

(13 Apr 2026) Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

(13 Apr 2026) Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate

(13 Apr 2026) Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

(12 Apr 2026) A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

(10 Apr 2026) Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

(10 Apr 2026) VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

(10 Apr 2026) Visually-Guided Policy Optimization for Multimodal Reasoning

(10 Apr 2026) ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

(09 Apr 2026) Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

(09 Apr 2026) TeamLLM: Exploring the Capabilities of LLMs for Multimodal Group Interaction Prediction

(09 Apr 2026) OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

(09 Apr 2026) Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

(09 Apr 2026) WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

(09 Apr 2026) AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models

(09 Apr 2026) RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

(09 Apr 2026) MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

(08 Apr 2026) Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

(08 Apr 2026) Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

(07 Apr 2026) MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

(07 Apr 2026) WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

(07 Apr 2026) Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank

(07 Apr 2026) Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

(06 Apr 2026) Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

(06 Apr 2026) Vero: An Open RL Recipe for General Visual Reasoning

(05 Apr 2026) Belief-Aware VLM Model for Human-like Reasoning

(03 Apr 2026) Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

(03 Apr 2026) Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

(03 Apr 2026) CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

(03 Apr 2026) FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

(02 Apr 2026) Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs

(02 Apr 2026) Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

(02 Apr 2026) Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

(02 Apr 2026) MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction

(01 Apr 2026) EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

(01 Apr 2026) All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

(30 Mar 2026) Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

(30 Mar 2026) MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

(30 Mar 2026) $AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning

(29 Mar 2026) MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences

(29 Mar 2026) LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

(29 Mar 2026) Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

(28 Mar 2026) Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models

(28 Mar 2026) Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

(27 Mar 2026) Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

(27 Mar 2026) Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives

(26 Mar 2026) R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

(26 Mar 2026) Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs

(24 Mar 2026) MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models

(24 Mar 2026) Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

(24 Mar 2026) GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning

(23 Mar 2026) Getting to the Point: Why Pointing Improves LVLMs

(23 Mar 2026) Rethinking Token Reduction for Large Vision-Language Models

(22 Mar 2026) RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

(21 Mar 2026) Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

(20 Mar 2026) One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment

(19 Mar 2026) Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

(19 Mar 2026) TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

(17 Mar 2026) HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

(17 Mar 2026) PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning

(17 Mar 2026) DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

(17 Mar 2026) The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

(17 Mar 2026) OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials

(17 Mar 2026) Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

(16 Mar 2026) MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

(16 Mar 2026) Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

(16 Mar 2026) From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

(15 Mar 2026) On the Nature of Attention Sink that Shapes Decoding Strategy in MLLMs

(14 Mar 2026) Improving Visual Reasoning with Iterative Evidence Refinement

(12 Mar 2026) Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

(10 Mar 2026) C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

(10 Mar 2026) MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

(10 Mar 2026) GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

(09 Mar 2026) MJ1: Multimodal Judgment via Grounded Verification

(08 Mar 2026) Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models

(08 Mar 2026) Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models

(07 Mar 2026) Perception-Aware Multimodal Spatial Reasoning from Monocular Images

(06 Mar 2026) MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation

(06 Mar 2026) PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

(06 Mar 2026) ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

(05 Mar 2026) Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

(04 Mar 2026) Phi-4-reasoning-vision-15B Technical Report

(04 Mar 2026) Discriminative Perception via Anchored Description for Reasoning Segmentation

(03 Mar 2026) TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

(03 Mar 2026) SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

(03 Mar 2026) Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling

(02 Mar 2026) Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

(01 Mar 2026) Can Thinking Models Think to Detect Hateful Memes?

(01 Mar 2026) ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

(01 Mar 2026) DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

(01 Mar 2026) ICPRL: Acquiring Physical Intuition from Interactive Control

(28 Feb 2026) PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

(27 Feb 2026) Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

(27 Feb 2026) EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

(27 Feb 2026) Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

(27 Feb 2026) Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

(27 Feb 2026) Reasoning-Driven Multimodal LLM for Domain Generalization

(26 Feb 2026) MediX-R1: Open Ended Medical Reinforcement Learning

(26 Feb 2026) FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

(26 Feb 2026) CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

(25 Feb 2026) MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

(25 Feb 2026) See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

(25 Feb 2026) RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

(24 Feb 2026) HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

(24 Feb 2026) Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking

(23 Feb 2026) TextShield-R1: Reinforced Reasoning for Tampered Text Detection

(23 Feb 2026) Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

(19 Feb 2026) RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

(19 Feb 2026) Enabling Training-Free Text-Based Remote Sensing Segmentation

(18 Feb 2026) Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

(18 Feb 2026) Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI

(17 Feb 2026) Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

(17 Feb 2026) On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

(16 Feb 2026) Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning

(16 Feb 2026) TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning

(16 Feb 2026) Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

(15 Feb 2026) ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization

(15 Feb 2026) GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

(14 Feb 2026) Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

(14 Feb 2026) Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

(13 Feb 2026) Reliable Thinking with Images

(13 Feb 2026) Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

(12 Feb 2026) Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning

(11 Feb 2026) Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

(10 Feb 2026) Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

(10 Feb 2026) Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

(10 Feb 2026) GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

(09 Feb 2026) AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

(09 Feb 2026) Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

(09 Feb 2026) When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

(07 Feb 2026) VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

(07 Feb 2026) Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

(06 Feb 2026) MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

(06 Feb 2026) SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

(05 Feb 2026) V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

(05 Feb 2026) Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

(05 Feb 2026) Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

(04 Feb 2026) Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

(04 Feb 2026) Training Data Efficiency in Multimodal Process Reward Models

(03 Feb 2026) Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

(02 Feb 2026) Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

(02 Feb 2026) VLM-Guided Experience Replay

(02 Feb 2026) ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning

(02 Feb 2026) Multimodal Large Language Models for Real-Time Situated Reasoning

(01 Feb 2026) StreamVLA: Breaking the Reason-Act Cycle via Completion-State Gating

(01 Feb 2026) Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis

(01 Feb 2026) SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning

(31 Jan 2026) RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

(29 Jan 2026) MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

(28 Jan 2026) MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

(27 Jan 2026) Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

(26 Jan 2026) AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

(22 Jan 2026) Explainable Deepfake Detection with RL Enhanced Self-Blended Images

(20 Jan 2026) Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

(19 Jan 2026) Think3D: Thinking with Space for Spatial Reasoning

(16 Jan 2026) ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

(16 Jan 2026) MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement

(12 Jan 2026) CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

(12 Jan 2026) MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

(11 Jan 2026) E^2-LLM: Bridging Neural Signals and Interpretable Affective Analysis

(11 Jan 2026) Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy

(09 Jan 2026) SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

(08 Jan 2026) Agri-R1: Agricultural Reasoning for Disease Diagnosis via Automated-Synthesis and Reinforcement Learning

(06 Jan 2026) ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

(06 Jan 2026) Towards Faithful Reasoning in Comics for Small MLLMs

(06 Jan 2026) Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting

(01 Jan 2026) CPPO: Contrastive Perception for Vision Language Policy Optimization

(01 Jan 2026) From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

(31 Dec 2025) From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme

(30 Dec 2025) SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

(22 Dec 2025) Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection

(22 Dec 2025) CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning

(22 Dec 2025) Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation

(22 Dec 2025) SafeMed-R1: Adversarial Reinforcement Learning for Generalizable and Robust Medical Reasoning in Vision-Language Models

(21 Dec 2025) ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning

(21 Dec 2025) Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback

(20 Dec 2025) Stable and Efficient Single-Rollout RL for Multimodal Reasoning

(19 Dec 2025) Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

(19 Dec 2025) Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

(18 Dec 2025) AdaTooler-V: Adaptive Tool-Use for Images and Videos

(16 Dec 2025) Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis

(16 Dec 2025) ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

(16 Dec 2025) OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

(15 Dec 2025) AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

(13 Dec 2025) More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

(13 Dec 2025) Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

(12 Dec 2025) DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

(09 Dec 2025) Thinking with Images via Self-Calling Agent

(08 Dec 2025) MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

(07 Dec 2025) Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

(07 Dec 2025) The Role of Entropy in Visual Grounding: Analysis and Optimization

(06 Dec 2025) ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models

(06 Dec 2025) VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

(03 Dec 2025) TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

(03 Dec 2025) Thinking with Programming Vision: Towards a Unified View for Thinking with Images

(03 Dec 2025) Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

(03 Dec 2025) Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning

(02 Dec 2025) See, Think, Learn: A Self-Taught Multimodal Reasoner

(29 Nov 2025) ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

(28 Nov 2025) TIM-PRM: Verifying multimodal reasoning with Tool-Integrated PRM

(27 Nov 2025) GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes

(26 Nov 2025) OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

(25 Nov 2025) VICoT-Agent: A Vision-Interleaved Chain-of-Thought Framework for Interpretable Multimodal Reasoning and Scalable Remote Sensing Analysis

(24 Nov 2025) Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

(23 Nov 2025) Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

(22 Nov 2025) PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning

(21 Nov 2025) VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

(21 Nov 2025) ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

(20 Nov 2025) OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

(19 Nov 2025) VisPlay: Self-Evolving Vision-Language Models from Images

(17 Nov 2025) Video Finetuning Improves Reasoning Between Frames

(17 Nov 2025) From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

(17 Nov 2025) SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

(14 Nov 2025) Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions

(13 Nov 2025) AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

(12 Nov 2025) History-Aware Reasoning for GUI Agents

(11 Nov 2025) From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

(10 Nov 2025) Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

(07 Nov 2025) PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

(04 Nov 2025) ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

(04 Nov 2025) SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

(01 Nov 2025) UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

(01 Nov 2025) Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning

(31 Oct 2025) GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

(31 Oct 2025) Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

(23 Oct 2025) Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

(23 Oct 202) Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

(18 Oct 2025) RL makes MLLMs see better than SFT

(16 Oct 2025) MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

(15 Oct 2025) Generative Universal Verifier as Multimodal Meta-Reasoner

(14 Oct 2025) HoneyBee: Data Recipes for Vision-Language Reasoners

(14 Oct 2025) DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

(10 Oct 2025) Unleashing Perception-Time Scaling to Multimodal Reasoning Models

(10 Oct 2025) Spotlight on Token Perception for Multimodal Reinforcement Learning

(10 Oct 2025) Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging

(13 Oct 2025) CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

(9 Oct 2025) ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

(9 Oct 2025) SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

(7 Oct 2025) Context Matters: Learning Global Semantics via Object-Centric Representation

(6 Oct 2025) Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

(3 Oct 2025) Efficient Test-Time Scaling for Small Vision-Language Models

(27 Sep 2025) Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning

(29 Sep 2025) Latent Visual Reasoning

(29 Sep 2025) GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

(28 Sep 2025) Poivre: Self-Refining Visual Pointing with Reinforcement Learning

(29 Sep 2025) VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding

(29 Sep 2025) Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

(25 Sep 2025) MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

(12 Sep 2025) LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

(9 Sep 2025) Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

(28 Aug 2025) R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

(27 Aug 2025) Self-Rewarding Vision-Language Model via Reasoning Decomposition

(18 Aug 2025) M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

(18 Aug 2025) Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation

(18 Aug 2025) Ovis2.5 Technical Report

(18 Aug 2025) MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

(8 Aug 2025) SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

(7 Aug 2025) Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

(7 Aug 2025) StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models

(5 Aug 2025) Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions

(30 Jul 2025) MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

(28 Jul 2025) Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

(24 Jul 2025) MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

(24 Jul 2025) SafeWork-R1: Coevolving Safety and Intelligence under the AI-45 Law

(22 Jul 2025) C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

(22 Jul 2025) Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

(11 Jul 2025) M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

(3 Jul 2025) Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

(1 Jul 2025) GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

(20 Jun 2025) GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

(16 Jun 2025) Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning

(11 Jun 2025) ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

(5 Jun 2025) Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

(5 Jun 2025) Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

(5 Jun 2025) MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

(16 May 2025) Visual Planning: Let's Think Only with Images

(15 May 2025) MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

(13 May 2025) OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

(12 May 2025) Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

(8 May 2025) Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

( 8 May 2025) SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

(6 May 2025) X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

(6 May 2025) Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

(6 May 2025) ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant

(5 May 2025) R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

(28 Apr 2025) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

(25 Apr 2025) Fast-Slow Thinking for Large Vision-Language Model Reasoning

(25 Apr 2025) Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

(25 Apr 2025) Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

(21 Apr 2025) A Call for New Recipes to Enhance Spatial Reasoning in MLLMs

(20 Apr 2025) Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension

(12 Apr 2025) VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

(10 Apr 2025) VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

(10 Apr 2025) SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

(10 Apr 2025) Perception-R1: Pioneering Perception Policy with Reinforcement Learning

(10 Apr 2025) Kimi-VL Technical Report

(8 Apr 2025) On the Suitability of Reinforcement Fine-Tuning to Visual Tasks

(8 Apr 2025) Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

(1 Apr 2025) Improved Visual-Spatial Reasoning via R1-Zero-Like Training

(17 Mar 2025) R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

(13 Mar 2025) VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

(9 Mar 2025) Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

(7 Mar 2025) R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

(7 Mar 2025) Unified Reward Model for Multimodal Understanding and Generation

(7 Mar 2025) R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

(3 Mar 2025) Visual-RFT: Visual Reinforcement Fine-Tuning

(4 Feb 2025) Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

(3 Jan 2025) Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

(13 Jan 2025) Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

(10 Jan 2025) LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

(9 Jan 2025) Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

(30 Dec 2024) Slow Perception: Let's Perceive Geometric Figures Step-by-step

(19 Dec 2024) Progressive Multimodal Reasoning via Active Retrieval

(29 Nov 2024) Interleaved-Modal Chain-of-Thought

(15 Nov 2024) Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

(15 Nov 2024) LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

(30 Oct 2024) Vision-Language Models Can Self-Improve Reasoning via Reflection

(23 Oct 2024) R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning

(11 Oct 2024) M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought

(6 Oct 2024) MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration

(4 Oct 2024) Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

(29 Sep 2024) CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

(13 Jun 2024) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

(28 Dec 2023) Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

(14 Dec 2023) Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

(27 Nov 2023) Compositional Chain-of-Thought Prompting for Large Multimodal Models

(15 Nov 2023) The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

(3 May 2023) Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

(16 Apr 2023) Chain of Thought Prompt Tuning in Vision Language Models

(2 Feb 2023) Multimodal Chain-of-Thought Reasoning in Language Models

Video

(13 Apr 2026) Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

(30 Mar 2026) SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

(28 Mar 2026) Incentivizing Temporal-Awareness in Egocentric Video Understanding Models

(27 Mar 2026) Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

(26 Mar 2026) VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

(26 Mar 2026) Reinforcing Structured Chain-of-Thought for Video Understanding

(24 Mar 2026) EVA: Efficient Reinforcement Learning for End-to-End Video Agent

(17 Mar 2026) When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

(12 Mar 2026) Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

(19 Feb 2026) GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

(12 Feb 2026) STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

(30 Jan 2026) Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

(28 Jan 2026) Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

(27 Jan 2026) Video-KTR: Reinforcing Video Reasoning via Key Token Attribution

(08 Jan 2026) VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

(07 Dec 2025) MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

(02 Dec 2025) OneThinker: All-in-one Reasoning Model for Image and Video

(28 Nov 2025) Video-CoM: Interactive Video Reasoning via Chain of Manipulations

(24 Nov 2025) VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

(17 Nov 2025) ViSS-R1: Self-Supervised Reinforcement Video Reasoning

(17 Nov 2025) DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

(23 Oct 2025) Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

(9 Oct 2025) SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

(6 Oct 202) Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

(5 Oct 2025) Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

(29 Sep 2025) FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

(29 Sep 2025) LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

(28 Sep 2025) FrameMind: Frame-Interleaved Chain-of-Thought for Video Reasoning via Reinforcement Learning

(12 Jun 2025) CogStream: Context-guided Streaming Video Question Answering

(6 Jun 2025) VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

(27 Mar 2025) Video-R1: Reinforcing Video Reasoning in MLLMs

(17 Feb 2025) video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

(10 Feb 2025) CoS: Chain-of-Shot Prompting for Long Video Understanding

(8 Jan 2025) Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

(3 Dec 2024) VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

(2 Dec 2024) Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

(29 Nov 2024) STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning

(12 Oct 2024) Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning

(27 Sep 2024) Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

(28 Aug 2024) Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

(24 May 2024) Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models

(7 May 2024) Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. code

(8 Oct 2024) Temporal Reasoning Transfer from Text to Video.

DLLM

(07 Apr 2026) Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

(23 Mar 2026) Tiny Inference-Time Scaling with Latent Verifiers

(12 Mar 2026) EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

(06 Mar 2026) Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

(12 Feb 2026) Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

(31 Jan 2026) Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings

(29 Dec 2025) ThinkGen: Generalized Thinking for Visual Generation

(25 Dec 2025) Toward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration

(03 Dec 2025) ReasonX: MLLM-Guided Intrinsic Image Decomposition

(27 Nov 2025) ReasonEdit: Towards Reasoning-Enhanced Image Editing Models

(9 Oct 2025) Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

(9 Oct 2025) Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization

Audio

(13 Apr 2026) Empowering Video Translation using Multimodal Large Language Models

(05 Mar 2026) SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

(25 Jan 2026) AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

(23 Oct 2025) Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

(10 Oct 2025) Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

(8 Oct 2025) Can Speech LLMs Think while Listening?

(5 Oct 2025) Principled and Tractable RL for Reasoning with Diffusion Language Models

(22 Jul 2025) Step-Audio 2 Technical Report

(14 Mar 2025) Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

Image/Video Generation

(15 Apr 2026) Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

(14 Apr 2026) Representation geometry shapes task performance in vision-language modeling for CT enterography

(14 Apr 2026) PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

(10 Mar 2026) Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

(04 Mar 2026) Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

(28 Jan 2026) Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

(26 Jan 2026) GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

(29 Dec 2025) REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

(23 Dec 2025) CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

(14 Dec 2025) Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

(04 Dec 2025) DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

(18 Nov 2025) UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

(13 Nov 2025) Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

(24 Oct 2025) Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation

(15 Oct 2025) Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

(9 Oct 2025) Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

(9 Oct 2025) Reinforcing Diffusion Models by Direct Group Preference Optimization

(9 Oct 2025) Real-Time Motion-Controllable Autoregressive Video Diffusion

(29 Sep 2025) STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

(28 Aug 2025) Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance

(28 Aug 2025) OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

(28 Aug 2025) Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

(27 Aug 2025) CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning

(9 Aug 2025) AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

(28 Jul 2025) Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

(20 Jun 2025) RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

(17 Jun 2025) SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

(16 May 2025) Towards Self-Improvement of Diffusion Models via Group Preference Optimization

(16 May 2025) Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models

(15 May 2025) Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

(12 May 2025) DanceGRPO: Unleashing GRPO on Visual Generation

(8 May 2025) Flow-GRPO: Training Flow Matching Models via Online RL

(1 May 2025) T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

(22 Apr 2025) From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

(22 Apr 2025) Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

(26 Mar 2025) MMGen: Unified Multi-modal Image Generation and Understanding in One Go

(13 Mar 2025) GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

(3 Mar 2025) MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

(23 Jan 2025) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Bench/Dataset

(15 Apr 2026) Reward Design for Physical Reasoning in Vision-Language Models

(15 Apr 2026) Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

(14 Apr 2026) Visual Preference Optimization with Rubric Rewards

(13 Apr 2026) Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

(13 Apr 2026) Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

(09 Apr 2026) Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

(08 Apr 2026) LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

(06 Apr 2026) Rethinking Model Efficiency: Multi-Agent Inference with Large Models

(05 Apr 2026) GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

(05 Apr 2026) Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges

(04 Apr 2026) FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

(31 Mar 2026) Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

(29 Mar 2026) Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning

(28 Mar 2026) Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

(25 Mar 2026) How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

(25 Mar 2026) NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

(23 Mar 2026) Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages

(23 Mar 2026) Rethinking Visual Privacy: A Compositional Privacy Risk Framework for Severity Assessment with VLMs

(13 Mar 2026) Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

(13 Mar 2026) Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

(12 Mar 2026) MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

(10 Mar 2026) Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

(10 Mar 2026) OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

(10 Mar 2026) EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

(07 Mar 2026) Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards

(06 Mar 2026) CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

(06 Mar 2026) TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis

(05 Mar 2026) Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

(03 Mar 2026) Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

(01 Mar 2026) When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

(28 Feb 2026) ReMoT: Reinforcement Learning with Motion Contrast Triplets

(27 Feb 2026) Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees

(27 Feb 2026) PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

(26 Feb 2026) Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

(25 Feb 2026) When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

(25 Feb 2026) PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

(24 Feb 2026) From Perception to Action: An Interactive Benchmark for Vision Reasoning

(18 Feb 2026) DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

(15 Feb 2026) Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

(13 Feb 2026) On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

(13 Feb 2026) MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

(12 Feb 2026) What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

(11 Feb 2026) TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

(11 Feb 2026) MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

(08 Feb 2026) SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

(05 Feb 2026) M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

(05 Feb 2026) OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

(04 Feb 2026) Reinforced Attention Learning

(01 Feb 2026) Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

(29 Jan 2026) SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

(21 Jan 2026) Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing

(19 Jan 2026) CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning

(15 Jan 2026) Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

(13 Jan 2026) M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

(02 Jan 2026) RoboReward: General-Purpose Vision-Language Reward Models for Robotics

(19 Dec 2025) FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

(16 Dec 2025) TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

(15 Dec 2025) MMhops-R1: Multimodal Multi-hop Reasoning

(11 Dec 2025) Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

(11 Dec 2025) Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

(10 Dec 2025) Rethinking Chain-of-Thought Reasoning for Videos

(09 Dec 2025) MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

(03 Dec 2025) Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs

(28 Nov 2025) AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

(28 Nov 2025) Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

(26 Nov 2025) Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

(25 Nov 2025) Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

(24 Nov 2025) CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

(21 Nov 2025) MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

(19 Nov 2025) Trustworthy and Fair SkinGPT-R1 for Democratizing Dermatological Reasoning across Diverse Ethnicities

(13 Nov 2025) Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

(12 Nov 2025) Simple Vision-Language Math Reasoning via Rendered Text

(09 Nov 2025) SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

(07 Nov 2025) Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

(03 Nov 2025) TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

(30 Oct 2025) ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

(15 Oct 2025) Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

(14 Oct 2025) Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

(10 Oct 2025) BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

(10 Oct 2025) SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

(9 Sep 2025) Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

(27 Aug 2025) 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

(8 Aug 2025) MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

(8 Aug 2025) InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

(22 Jul 2025) ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

(22 Jul 2025) Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

(12 Jun 2025) VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

(12 Jun 2025) MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

(6 Jun 2025) PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

(5 Jun 2025) VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

(5 Jun 2025) MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

(15 May 2025) StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

(13 May 2025) VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

(1 May 2025) MINERVA: Evaluating Complex Video Reasoning

(30 Apr 2025) GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

(21 Apr 2025) IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

(21 Apr 2025) VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

(17 Apr 2025) Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

(16 Apr 2025) FLIP Reasoning Challenge

(14 Apr 2025) VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

(8 Apr 2025) ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering

(8 Apr 2025) V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

(8 Apr 2025) MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

(4 Apr 2025) Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

(15 Feb 2025) SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

(14 Feb 2025) MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

(13 Feb 2025) MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

(18 Dec 2024) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces.

(22 Nov 2024) VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection. code

(18 Oct 2024) MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps

(7 Jul 2024) VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

(20 Jun 2024) MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

(12 Jun 2024) LVBench: An Extreme Long Video Understanding Benchmark

(24 Apr 2024) Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

(16 Apr 2024) OpenEQA: Embodied Question Answering in the Era of Foundation Models

(17 Aug 2023) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

(23 May 2023) Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought.

(18 May 2021) NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions

Latent

(09 Apr 2026) Multimodal Latent Reasoning via Predictive Embeddings

(02 Apr 2026) PLUME: Latent Reasoning Based Universal Multimodal Embedding

(26 Mar 2026) LanteRn: Latent Visual Structured Reasoning

(23 Mar 2026) Q-Tacit: Image Quality Assessment via Latent Visual Reasoning

(24 Feb 2026) CrystaL: Spontaneous Emergence of Visual Latents in MLLMs

(05 Feb 2026) Multimodal Latent Reasoning via Hierarchical Visual Cues Injection

(04 Feb 2026) Vision-aligned Latent Reasoning for Multi-modal Large Language Model

(28 Dec 2025) ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

(26 Nov 2025) Monet: Reasoning in Latent Visual Space Beyond Images and Language

(22 Nov 2025) L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

(04 Nov 2025) Multimodal Reasoning via Latent Refocusing

(29 Sep 2025) Latent Visual Reasoning

(12 Feb 2025) Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning

(7 Feb 2025) Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

(9 Dec 2024) Training Large Language Models to Reason in a Continuous Latent Space

Open Source Project

https://github.com/Hui-design/Open-LLaVA-Video-R1

https://github.com/SkyworkAI/Skywork-R1V

https://huggingface.co/papers/2503.05379

https://github.com/Osilly/Vision-R1

https://github.com/ModalMinds/MM-EUREKA

https://github.com/OpenRLHF/OpenRLHF-M

https://github.com/Fancy-MLLM/R1-Onevision

https://github.com/om-ai-lab/VLM-R1

https://github.com/EvolvingLMMs-Lab/open-r1-multimodal

https://github.com/Deep-Agent/R1-V

https://github.com/TideDra/lmm-r1

https://github.com/tulerfeng/Video-R1

https://github.com/Wang-Xiaodong1899/Open-R1-Video

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Multimodal-Reasoning

⭐ If you find this list useful, welcome to star it!

Table of Contents

Paper List (Updating...)

Survey

Image Reasoning

Video

DLLM

Audio

Image/Video Generation

Bench/Dataset

Latent

Open Source Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Reasoning

⭐ If you find this list useful, welcome to star it!

Table of Contents

Paper List (Updating...)

Survey

Image Reasoning

Video

DLLM

Audio

Image/Video Generation

Bench/Dataset

Latent

Open Source Project

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Packages