Skip to content

The-Martyr/Awesome-Multimodal-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 

Repository files navigation

Awesome-Multimodal-Reasoning Awesome

This is a repository for organizing papers related to Multimodal Reasoning in Multimodal Large Language Models (Image, Video).

With the development of the visual (audio) capabilities and reasoning capabilities (RL powered) of multimodal large language models(MLLMs/LVLMs/LSLMs), researchers have high hopes for the multimodal reasoning capabilities of MLLM/LVLM/LSLM.

This repo also select paper about visual generation (image generation/video generation) with RL/CoT.

This repository aims to cover ALL possibly relevant papers to support research surveys, not just selected best papers. We believe comprehensive coverage is more valuable for researchers than a curated list.

⭐ If you find this list useful, welcome to star it!

Table of Contents

Paper List (Updating...)

Survey

(8 May 2025) Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models arXiv

(30 Apr 2025) Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models arXiv

(4 Apr 2025) Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning arXiv

(18 Mar 2025) Aligning Multimodal LLM with Human Preference: A Survey arXiv

(16 Mar 2025) Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey arXiv

Image Reasoning

(13 Apr 2026) POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs arXiv

(13 Apr 2026) CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning arXiv

(13 Apr 2026) Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games arXiv

(13 Apr 2026) Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate arXiv

(13 Apr 2026) Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images arXiv

(12 Apr 2026) A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning arXiv

(10 Apr 2026) Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents arXiv

(10 Apr 2026) VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning arXiv

(10 Apr 2026) Visually-Guided Policy Optimization for Multimodal Reasoning arXiv

(10 Apr 2026) ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning arXiv

(09 Apr 2026) Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models arXiv

(09 Apr 2026) TeamLLM: Exploring the Capabilities of LLMs for Multimodal Group Interaction Prediction arXiv

(09 Apr 2026) OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks arXiv

(09 Apr 2026) Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data arXiv

(09 Apr 2026) WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models arXiv

(09 Apr 2026) AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models arXiv

(09 Apr 2026) RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs arXiv

(09 Apr 2026) MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning arXiv

(08 Apr 2026) Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization arXiv

(08 Apr 2026) Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning arXiv

(07 Apr 2026) MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control arXiv

(07 Apr 2026) WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering arXiv

(07 Apr 2026) Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank arXiv

(07 Apr 2026) Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction arXiv

(06 Apr 2026) Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward arXiv

(06 Apr 2026) Vero: An Open RL Recipe for General Visual Reasoning arXiv

(05 Apr 2026) Belief-Aware VLM Model for Human-like Reasoning arXiv

(03 Apr 2026) Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models arXiv

(03 Apr 2026) Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models arXiv

(03 Apr 2026) CharTool: Tool-Integrated Visual Reasoning for Chart Understanding arXiv

(03 Apr 2026) FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation arXiv

(02 Apr 2026) Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs arXiv

(02 Apr 2026) Impact of Multimodal and Conversational AI on Learning Outcomes and Experience arXiv

(02 Apr 2026) Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models arXiv

(02 Apr 2026) MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction arXiv

(01 Apr 2026) EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs arXiv

(01 Apr 2026) All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models arXiv

(30 Mar 2026) Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning arXiv

(30 Mar 2026) MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding arXiv

(30 Mar 2026) $AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning arXiv

(29 Mar 2026) MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences arXiv

(29 Mar 2026) LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation arXiv

(29 Mar 2026) Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs arXiv

(28 Mar 2026) Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models arXiv

(28 Mar 2026) Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models arXiv

(27 Mar 2026) Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR arXiv

(27 Mar 2026) Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives arXiv

(26 Mar 2026) R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning arXiv

(26 Mar 2026) Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs arXiv

(24 Mar 2026) MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models arXiv

(24 Mar 2026) Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought arXiv

(24 Mar 2026) GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning arXiv

(23 Mar 2026) Getting to the Point: Why Pointing Improves LVLMs arXiv

(23 Mar 2026) Rethinking Token Reduction for Large Vision-Language Models arXiv

(22 Mar 2026) RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models arXiv

(21 Mar 2026) Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning arXiv

(20 Mar 2026) One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment arXiv

(19 Mar 2026) Balanced Thinking: Improving Chain of Thought Training in Vision Language Models arXiv

(19 Mar 2026) TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation arXiv

(17 Mar 2026) HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning arXiv

(17 Mar 2026) PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning arXiv

(17 Mar 2026) DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models arXiv

(17 Mar 2026) The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models arXiv

(17 Mar 2026) OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials arXiv

(17 Mar 2026) Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning arXiv

(16 Mar 2026) MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings arXiv

(16 Mar 2026) Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing arXiv

(16 Mar 2026) From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation arXiv

(15 Mar 2026) On the Nature of Attention Sink that Shapes Decoding Strategy in MLLMs arXiv

(14 Mar 2026) Improving Visual Reasoning with Iterative Evidence Refinement arXiv

(12 Mar 2026) Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation arXiv

(10 Mar 2026) C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis arXiv

(10 Mar 2026) MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning arXiv

(10 Mar 2026) GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision arXiv

(09 Mar 2026) MJ1: Multimodal Judgment via Grounded Verification arXiv

(08 Mar 2026) Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models arXiv

(08 Mar 2026) Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models arXiv

(07 Mar 2026) Perception-Aware Multimodal Spatial Reasoning from Monocular Images arXiv

(06 Mar 2026) MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation arXiv

(06 Mar 2026) PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues arXiv

(06 Mar 2026) ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning arXiv

(05 Mar 2026) Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum arXiv

(04 Mar 2026) Phi-4-reasoning-vision-15B Technical Report arXiv

(04 Mar 2026) Discriminative Perception via Anchored Description for Reasoning Segmentation arXiv

(03 Mar 2026) TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval arXiv

(03 Mar 2026) SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety arXiv

(03 Mar 2026) Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling arXiv

(02 Mar 2026) Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine arXiv

(01 Mar 2026) Can Thinking Models Think to Detect Hateful Memes? arXiv

(01 Mar 2026) ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models arXiv

(01 Mar 2026) DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage arXiv

(01 Mar 2026) ICPRL: Acquiring Physical Intuition from Interactive Control arXiv

(28 Feb 2026) PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment arXiv

(27 Feb 2026) Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought arXiv

(27 Feb 2026) EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models arXiv

(27 Feb 2026) Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning arXiv

(27 Feb 2026) Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification arXiv

(27 Feb 2026) Reasoning-Driven Multimodal LLM for Domain Generalization arXiv

(26 Feb 2026) MediX-R1: Open Ended Medical Reinforcement Learning arXiv

(26 Feb 2026) FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning arXiv

(26 Feb 2026) CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays arXiv

(25 Feb 2026) MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving arXiv

(25 Feb 2026) See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs arXiv

(25 Feb 2026) RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning arXiv

(24 Feb 2026) HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning arXiv

(24 Feb 2026) Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking arXiv

(23 Feb 2026) TextShield-R1: Reinforced Reasoning for Tampered Text Detection arXiv

(23 Feb 2026) Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking arXiv

(19 Feb 2026) RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward arXiv

(19 Feb 2026) Enabling Training-Free Text-Based Remote Sensing Segmentation arXiv

(18 Feb 2026) Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning arXiv

(18 Feb 2026) Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI arXiv

(17 Feb 2026) Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families arXiv

(17 Feb 2026) On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks arXiv

(16 Feb 2026) Diagnosing Knowledge Conflict in Multimodal Long-Chain Reasoning arXiv

(16 Feb 2026) TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning arXiv

(16 Feb 2026) Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response arXiv

(15 Feb 2026) ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization arXiv

(15 Feb 2026) GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery arXiv

(14 Feb 2026) Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings arXiv

(14 Feb 2026) Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization arXiv

(13 Feb 2026) Reliable Thinking with Images arXiv

(13 Feb 2026) Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation arXiv

(12 Feb 2026) Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning arXiv

(11 Feb 2026) Canvas-of-Thought: Grounding Reasoning via Mutable Structured States arXiv

(10 Feb 2026) Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models arXiv

(10 Feb 2026) Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension arXiv

(10 Feb 2026) GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation arXiv

(09 Feb 2026) AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection arXiv

(09 Feb 2026) Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs arXiv

(09 Feb 2026) When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning arXiv

(07 Feb 2026) VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation arXiv

(07 Feb 2026) Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning arXiv

(06 Feb 2026) MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images arXiv

(06 Feb 2026) SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs arXiv

(05 Feb 2026) V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval arXiv

(05 Feb 2026) Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR arXiv

(05 Feb 2026) Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning arXiv

(04 Feb 2026) Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision arXiv

(04 Feb 2026) Training Data Efficiency in Multimodal Process Reward Models arXiv

(03 Feb 2026) Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance arXiv

(02 Feb 2026) Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling arXiv

(02 Feb 2026) VLM-Guided Experience Replay arXiv

(02 Feb 2026) ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning arXiv

(02 Feb 2026) Multimodal Large Language Models for Real-Time Situated Reasoning arXiv

(01 Feb 2026) StreamVLA: Breaking the Reason-Act Cycle via Completion-State Gating arXiv

(01 Feb 2026) Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis arXiv

(01 Feb 2026) SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning arXiv

(31 Jan 2026) RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding arXiv

(29 Jan 2026) MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods arXiv

(28 Jan 2026) MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models arXiv

(27 Jan 2026) Innovator-VL: A Multimodal Large Language Model for Scientific Discovery arXiv

(26 Jan 2026) AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning arXiv

(22 Jan 2026) Explainable Deepfake Detection with RL Enhanced Self-Blended Images arXiv

(20 Jan 2026) Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring arXiv

(19 Jan 2026) Think3D: Thinking with Space for Spatial Reasoning arXiv

(16 Jan 2026) ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models arXiv

(16 Jan 2026) MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement arXiv

(12 Jan 2026) CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation arXiv

(12 Jan 2026) MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning arXiv

(11 Jan 2026) E^2-LLM: Bridging Neural Signals and Interpretable Affective Analysis arXiv

(11 Jan 2026) Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy arXiv

(09 Jan 2026) SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More arXiv

(08 Jan 2026) Agri-R1: Agricultural Reasoning for Disease Diagnosis via Automated-Synthesis and Reinforcement Learning arXiv

(06 Jan 2026) ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing arXiv

(06 Jan 2026) Towards Faithful Reasoning in Comics for Small MLLMs arXiv

(06 Jan 2026) Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting arXiv

(01 Jan 2026) CPPO: Contrastive Perception for Vision Language Policy Optimization arXiv

(01 Jan 2026) From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning arXiv

(31 Dec 2025) From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme arXiv

(30 Dec 2025) SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning arXiv

(22 Dec 2025) Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection arXiv

(22 Dec 2025) CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning arXiv

(22 Dec 2025) Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation arXiv

(22 Dec 2025) SafeMed-R1: Adversarial Reinforcement Learning for Generalizable and Robust Medical Reasoning in Vision-Language Models arXiv

(21 Dec 2025) ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning arXiv

(21 Dec 2025) Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback arXiv

(20 Dec 2025) Stable and Efficient Single-Rollout RL for Multimodal Reasoning arXiv

(19 Dec 2025) Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images arXiv

(19 Dec 2025) Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding arXiv

(18 Dec 2025) AdaTooler-V: Adaptive Tool-Use for Images and Videos arXiv

(16 Dec 2025) Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis arXiv

(16 Dec 2025) ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking arXiv

(16 Dec 2025) OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving arXiv

(15 Dec 2025) AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning arXiv

(13 Dec 2025) More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models arXiv

(13 Dec 2025) Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking arXiv

(12 Dec 2025) DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry arXiv

(09 Dec 2025) Thinking with Images via Self-Calling Agent arXiv

(08 Dec 2025) MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning arXiv

(07 Dec 2025) Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning arXiv

(07 Dec 2025) The Role of Entropy in Visual Grounding: Analysis and Optimization arXiv

(06 Dec 2025) ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models arXiv

(06 Dec 2025) VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning arXiv

(03 Dec 2025) TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning arXiv

(03 Dec 2025) Thinking with Programming Vision: Towards a Unified View for Thinking with Images arXiv

(03 Dec 2025) Multimodal Reinforcement Learning with Agentic Verifier for AI Agents arXiv

(03 Dec 2025) Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning arXiv

(02 Dec 2025) See, Think, Learn: A Self-Taught Multimodal Reasoner arXiv

(29 Nov 2025) ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning arXiv

(28 Nov 2025) TIM-PRM: Verifying multimodal reasoning with Tool-Integrated PRM arXiv

(27 Nov 2025) GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes arXiv

(26 Nov 2025) OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection arXiv

(25 Nov 2025) VICoT-Agent: A Vision-Interleaved Chain-of-Thought Framework for Interpretable Multimodal Reasoning and Scalable Remote Sensing Analysis arXiv

(24 Nov 2025) Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning arXiv

(23 Nov 2025) Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning arXiv

(22 Nov 2025) PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning arXiv

(21 Nov 2025) VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning arXiv

(21 Nov 2025) ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better arXiv

(20 Nov 2025) OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe arXiv

(19 Nov 2025) VisPlay: Self-Evolving Vision-Language Models from Images arXiv

(17 Nov 2025) Video Finetuning Improves Reasoning Between Frames arXiv

(17 Nov 2025) From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models arXiv

(17 Nov 2025) SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization arXiv

(14 Nov 2025) Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions arXiv

(13 Nov 2025) AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models arXiv

(12 Nov 2025) History-Aware Reasoning for GUI Agents arXiv

(11 Nov 2025) From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training arXiv

(10 Nov 2025) Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View arXiv

(07 Nov 2025) PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization arXiv

(04 Nov 2025) ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension arXiv

(04 Nov 2025) SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning arXiv

(01 Nov 2025) UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings arXiv

(01 Nov 2025) Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning arXiv

(31 Oct 2025) GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation arXiv

(31 Oct 2025) Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning arXiv

(23 Oct 2025) Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning arXiv

(23 Oct 202) Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation arXiv

(18 Oct 2025) RL makes MLLMs see better than SFT arXiv

(16 Oct 2025) MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning arXiv

(15 Oct 2025) Generative Universal Verifier as Multimodal Meta-Reasoner arXiv

(14 Oct 2025) HoneyBee: Data Recipes for Vision-Language Reasoners arXiv

(14 Oct 2025) DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search arXiv

(10 Oct 2025) Unleashing Perception-Time Scaling to Multimodal Reasoning Models arXiv

(10 Oct 2025) Spotlight on Token Perception for Multimodal Reinforcement Learning arXiv

(10 Oct 2025) Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging arXiv

(13 Oct 2025) CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images arXiv

(9 Oct 2025) ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping arXiv

(9 Oct 2025) SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models arXiv

(7 Oct 2025) Context Matters: Learning Global Semantics via Object-Centric Representation arXiv

(6 Oct 2025) Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment arXiv

(3 Oct 2025) Efficient Test-Time Scaling for Small Vision-Language Models arXiv

(27 Sep 2025) Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning arXiv

(29 Sep 2025) Latent Visual Reasoning arXiv

(29 Sep 2025) GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning arXiv

(28 Sep 2025) Poivre: Self-Refining Visual Pointing with Reinforcement Learning arXiv

(29 Sep 2025) VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding arXiv

(29 Sep 2025) Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks arXiv

(25 Sep 2025) MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources arXiv

(12 Sep 2025) LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA arXiv

(9 Sep 2025) Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search arXiv

(28 Aug 2025) R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning arXiv

(27 Aug 2025) Self-Rewarding Vision-Language Model via Reasoning Decomposition arXiv

(18 Aug 2025) M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following arXiv

(18 Aug 2025) Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation arXiv

(18 Aug 2025) Ovis2.5 Technical Report arXiv

(18 Aug 2025) MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models arXiv

(8 Aug 2025) SIFThinker: Spatially-Aware Image Focus for Visual Reasoning arXiv

(7 Aug 2025) Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision arXiv

(7 Aug 2025) StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models arXiv

(5 Aug 2025) Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions arXiv

(30 Jul 2025) MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention arXiv

(28 Jul 2025) Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback arXiv

(24 Jul 2025) MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning arXiv

(24 Jul 2025) SafeWork-R1: Coevolving Safety and Intelligence under the AI-45 Law arXiv

(22 Jul 2025) C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning arXiv

(22 Jul 2025) Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning arXiv

(11 Jul 2025) M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning arXiv

(3 Jul 2025) Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation arXiv

(1 Jul 2025) GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning arXiv

(20 Jun 2025) GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning arXiv

(16 Jun 2025) Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning arXiv

(11 Jun 2025) ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs arXiv

(5 Jun 2025) Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning arXiv

(5 Jun 2025) Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos arXiv

(5 Jun 2025) MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning arXiv

(16 May 2025) Visual Planning: Let's Think Only with Images arXiv

(15 May 2025) MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning arXiv

(13 May 2025) OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning arXiv

(12 May 2025) Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning arXiv

(8 May 2025) Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging arXiv

( 8 May 2025) SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models arXiv

(6 May 2025) X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains arXiv

(6 May 2025) Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning arXiv

(6 May 2025) ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant arXiv

(5 May 2025) R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning arXiv

(28 Apr 2025) SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning arXiv

(25 Apr 2025) Fast-Slow Thinking for Large Vision-Language Model Reasoning arXiv

(25 Apr 2025) Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization arXiv

(25 Apr 2025) Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning arXiv

(21 Apr 2025) A Call for New Recipes to Enhance Spatial Reasoning in MLLMs arXiv

(20 Apr 2025) Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension arXiv

(12 Apr 2025) VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search arXiv

(10 Apr 2025) VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model arXiv

(10 Apr 2025) SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement arXiv

(10 Apr 2025) Perception-R1: Pioneering Perception Policy with Reinforcement Learning arXiv

(10 Apr 2025) Kimi-VL Technical Report arXiv

(8 Apr 2025) On the Suitability of Reinforcement Fine-Tuning to Visual Tasks arXiv

(8 Apr 2025) Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought arXiv

(1 Apr 2025) Improved Visual-Spatial Reasoning via R1-Zero-Like Training arXiv

(17 Mar 2025) R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization arXiv

(13 Mar 2025) VisualPRM: An Effective Process Reward Model for Multimodal Reasoning arXiv

(9 Mar 2025) Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models arXiv

(7 Mar 2025) R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning arXiv

(7 Mar 2025) Unified Reward Model for Multimodal Understanding and Generation arXiv

(7 Mar 2025) R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model arXiv

(3 Mar 2025) Visual-RFT: Visual Reinforcement Fine-Tuning arXiv

(4 Feb 2025) Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking arXiv

(3 Jan 2025) Virgo: A Preliminary Exploration on Reproducing o1-like MLLM arXiv

(13 Jan 2025) Imagine while Reasoning in Space: Multimodal Visualization-of-Thought arXiv

(10 Jan 2025) LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs arXiv

(9 Jan 2025) Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark arXiv

(30 Dec 2024) Slow Perception: Let's Perceive Geometric Figures Step-by-step arXiv

(19 Dec 2024) Progressive Multimodal Reasoning via Active Retrieval arXiv

(29 Nov 2024) Interleaved-Modal Chain-of-Thought arXiv

(15 Nov 2024) Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination arXiv

(15 Nov 2024) LLaVA-CoT: Let Vision Language Models Reason Step-by-Step arXiv

(30 Oct 2024) Vision-Language Models Can Self-Improve Reasoning via Reflection arXiv

(23 Oct 2024) R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models arXiv

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning arXiv

(11 Oct 2024) M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought arXiv

(6 Oct 2024) MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration arXiv

(4 Oct 2024) Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning arXiv

(29 Sep 2024) CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought arXiv

(13 Jun 2024) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models arXiv

(28 Dec 2023) Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos arXiv

(14 Dec 2023) Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models arXiv

(27 Nov 2023) Compositional Chain-of-Thought Prompting for Large Multimodal Models arXiv

(15 Nov 2023) The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task arXiv

(3 May 2023) Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv

(16 Apr 2023) Chain of Thought Prompt Tuning in Vision Language Models arXiv

(2 Feb 2023) Multimodal Chain-of-Thought Reasoning in Language Models arXiv

Video

(13 Apr 2026) Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding arXiv

(30 Mar 2026) SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning arXiv

(28 Mar 2026) Incentivizing Temporal-Awareness in Egocentric Video Understanding Models arXiv

(27 Mar 2026) Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning arXiv

(26 Mar 2026) VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning arXiv

(26 Mar 2026) Reinforcing Structured Chain-of-Thought for Video Understanding arXiv

(24 Mar 2026) EVA: Efficient Reinforcement Learning for End-to-End Video Agent arXiv

(17 Mar 2026) When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition arXiv

(12 Mar 2026) Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models arXiv

(19 Feb 2026) GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking arXiv

(12 Feb 2026) STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning arXiv

(30 Jan 2026) Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning arXiv

(28 Jan 2026) Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning arXiv

(27 Jan 2026) Video-KTR: Reinforcing Video Reasoning via Key Token Attribution arXiv

(08 Jan 2026) VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice arXiv

(07 Dec 2025) MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning arXiv

(02 Dec 2025) OneThinker: All-in-one Reasoning Model for Image and Video arXiv

(28 Nov 2025) Video-CoM: Interactive Video Reasoning via Chain of Manipulations arXiv

(24 Nov 2025) VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning arXiv

(17 Nov 2025) ViSS-R1: Self-Supervised Reinforcement Video Reasoning arXiv

(17 Nov 2025) DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning arXiv

(23 Oct 2025) Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence arXiv

(9 Oct 2025) SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models arXiv

(6 Oct 202) Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models arXiv

(5 Oct 2025) Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning arXiv

(29 Sep 2025) FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting arXiv

(29 Sep 2025) LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning arXiv

(28 Sep 2025) FrameMind: Frame-Interleaved Chain-of-Thought for Video Reasoning via Reinforcement Learning arXiv

(12 Jun 2025) CogStream: Context-guided Streaming Video Question Answering arXiv

(6 Jun 2025) VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning arXiv

(27 Mar 2025) Video-R1: Reinforcing Video Reasoning in MLLMs arXiv

(17 Feb 2025) video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model arXiv

(10 Feb 2025) CoS: Chain-of-Shot Prompting for Long Video Understanding arXiv

(8 Jan 2025) Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs arXiv

(3 Dec 2024) VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation arXiv

(2 Dec 2024) Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation arXiv

(29 Nov 2024) STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training arXiv

(21 Oct 2024) Improve Vision Language Model Chain-of-thought Reasoning arXiv

(12 Oct 2024) Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning arXiv

(27 Sep 2024) Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks arXiv

(28 Aug 2024) Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation arXiv

(24 May 2024) Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models arXiv

(7 May 2024) Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. arXiv code

(8 Oct 2024) Temporal Reasoning Transfer from Text to Video. arXiv

DLLM

(07 Apr 2026) Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models arXiv

(23 Mar 2026) Tiny Inference-Time Scaling with Latent Verifiers arXiv

(12 Mar 2026) EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models arXiv

(06 Mar 2026) Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion arXiv

(12 Feb 2026) Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation arXiv

(31 Jan 2026) Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings arXiv

(29 Dec 2025) ThinkGen: Generalized Thinking for Visual Generation arXiv

(25 Dec 2025) Toward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration arXiv

(03 Dec 2025) ReasonX: MLLM-Guided Intrinsic Image Decomposition arXiv

(27 Nov 2025) ReasonEdit: Towards Reasoning-Enhanced Image Editing Models arXiv

(9 Oct 2025) Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization arXiv

(9 Oct 2025) Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization arXiv

Audio

(13 Apr 2026) Empowering Video Translation using Multimodal Large Language Models arXiv

(05 Mar 2026) SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning arXiv

(25 Jan 2026) AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation arXiv

(23 Oct 2025) Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards arXiv

(10 Oct 2025) Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models arXiv

(8 Oct 2025) Can Speech LLMs Think while Listening? arXiv

(5 Oct 2025) Principled and Tractable RL for Reasoning with Diffusion Language Models arXiv

(22 Jul 2025) Step-Audio 2 Technical Report arXiv

(14 Mar 2025) Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering arXiv

Image/Video Generation

(15 Apr 2026) Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning arXiv

(14 Apr 2026) Representation geometry shapes task performance in vision-language modeling for CT enterography arXiv

(14 Apr 2026) PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning arXiv

(10 Mar 2026) Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization arXiv

(04 Mar 2026) Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks arXiv

(28 Jan 2026) Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought arXiv

(26 Jan 2026) GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning arXiv

(29 Dec 2025) REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation arXiv

(23 Dec 2025) CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation arXiv

(14 Dec 2025) Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space arXiv

(04 Dec 2025) DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation arXiv

(18 Nov 2025) UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning arXiv

(13 Nov 2025) Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance arXiv

(24 Oct 2025) Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation arXiv

(15 Oct 2025) Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation arXiv

(9 Oct 2025) Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing arXiv

(9 Oct 2025) Reinforcing Diffusion Models by Direct Group Preference Optimization arXiv

(9 Oct 2025) Real-Time Motion-Controllable Autoregressive Video Diffusion arXiv

(29 Sep 2025) STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation arXiv

(28 Aug 2025) Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance arXiv

(28 Aug 2025) OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning arXiv

(28 Aug 2025) Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning arXiv

(27 Aug 2025) CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning arXiv

(9 Aug 2025) AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning arXiv

(28 Jul 2025) Multimodal LLMs as Customized Reward Models for Text-to-Image Generation arXiv

(20 Jun 2025) RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought arXiv

(17 Jun 2025) SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks arXiv

(16 May 2025) Towards Self-Improvement of Diffusion Models via Group Preference Optimization arXiv

(16 May 2025) Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models arXiv

(15 May 2025) Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models arXiv

(12 May 2025) DanceGRPO: Unleashing GRPO on Visual Generation arXiv

(8 May 2025) Flow-GRPO: Training Flow Matching Models via Online RL arXiv

(1 May 2025) T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT arXiv

(22 Apr 2025) From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning arXiv

(22 Apr 2025) Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning arXiv

(26 Mar 2025) MMGen: Unified Multi-modal Image Generation and Understanding in One Go arXiv

(13 Mar 2025) GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing arXiv

(3 Mar 2025) MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation arXiv

(23 Jan 2025) Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step arXiv

Bench/Dataset

(15 Apr 2026) Reward Design for Physical Reasoning in Vision-Language Models arXiv

(15 Apr 2026) Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges arXiv

(14 Apr 2026) Visual Preference Optimization with Rubric Rewards arXiv

(13 Apr 2026) Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning arXiv

(13 Apr 2026) Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding arXiv

(09 Apr 2026) Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization arXiv

(08 Apr 2026) LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment arXiv

(06 Apr 2026) Rethinking Model Efficiency: Multi-Agent Inference with Large Models arXiv

(05 Apr 2026) GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces arXiv

(05 Apr 2026) Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges arXiv

(04 Apr 2026) FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning arXiv

(31 Mar 2026) Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models arXiv

(29 Mar 2026) Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning arXiv

(28 Mar 2026) Inference-Time Structural Reasoning for Compositional Vision-Language Understanding arXiv

(25 Mar 2026) How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning arXiv

(25 Mar 2026) NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders arXiv

(23 Mar 2026) Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages arXiv

(23 Mar 2026) Rethinking Visual Privacy: A Compositional Privacy Risk Framework for Severity Assessment with VLMs arXiv

(13 Mar 2026) Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World arXiv

(13 Mar 2026) Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation arXiv

(12 Mar 2026) MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning arXiv

(10 Mar 2026) Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs arXiv

(10 Mar 2026) OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks arXiv

(10 Mar 2026) EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning arXiv

(07 Mar 2026) Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards arXiv

(06 Mar 2026) CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning arXiv

(06 Mar 2026) TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis arXiv

(05 Mar 2026) Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs arXiv

(03 Mar 2026) Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning arXiv

(01 Mar 2026) When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains arXiv

(28 Feb 2026) ReMoT: Reinforcement Learning with Motion Contrast Triplets arXiv

(27 Feb 2026) Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees arXiv

(27 Feb 2026) PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning arXiv

(26 Feb 2026) Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning arXiv

(25 Feb 2026) When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning arXiv

(25 Feb 2026) PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning arXiv

(24 Feb 2026) From Perception to Action: An Interactive Benchmark for Vision Reasoning arXiv

(18 Feb 2026) DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning arXiv

(15 Feb 2026) Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding arXiv

(13 Feb 2026) On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs arXiv

(13 Feb 2026) MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs arXiv

(12 Feb 2026) What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis arXiv

(11 Feb 2026) TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning arXiv

(11 Feb 2026) MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning arXiv

(08 Feb 2026) SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models arXiv

(05 Feb 2026) M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning arXiv

(05 Feb 2026) OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention arXiv

(04 Feb 2026) Reinforced Attention Learning arXiv

(01 Feb 2026) Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning arXiv

(29 Jan 2026) SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding arXiv

(21 Jan 2026) Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing arXiv

(19 Jan 2026) CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning arXiv

(15 Jan 2026) Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge arXiv

(13 Jan 2026) M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding arXiv

(02 Jan 2026) RoboReward: General-Purpose Vision-Language Reward Models for Robotics arXiv

(19 Dec 2025) FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis arXiv

(16 Dec 2025) TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs arXiv

(15 Dec 2025) MMhops-R1: Multimodal Multi-hop Reasoning arXiv

(11 Dec 2025) Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules arXiv

(11 Dec 2025) Limits and Gains of Test-Time Scaling in Vision-Language Reasoning arXiv

(10 Dec 2025) Rethinking Chain-of-Thought Reasoning for Videos arXiv

(09 Dec 2025) MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models arXiv

(03 Dec 2025) Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs arXiv

(28 Nov 2025) AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture arXiv

(28 Nov 2025) Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models arXiv

(26 Nov 2025) Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models arXiv

(25 Nov 2025) Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement arXiv

(24 Nov 2025) CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization arXiv

(21 Nov 2025) MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models arXiv

(19 Nov 2025) Trustworthy and Fair SkinGPT-R1 for Democratizing Dermatological Reasoning across Diverse Ethnicities arXiv

(13 Nov 2025) Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling arXiv

(12 Nov 2025) Simple Vision-Language Math Reasoning via Rendered Text arXiv

(09 Nov 2025) SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports arXiv

(07 Nov 2025) Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale arXiv

(03 Nov 2025) TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning arXiv

(30 Oct 2025) ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning arXiv

(15 Oct 2025) Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models arXiv

(14 Oct 2025) Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning arXiv

(10 Oct 2025) BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception arXiv

(10 Oct 2025) SpaceVista: All-Scale Visual Spatial Reasoning from mm to km arXiv

(9 Sep 2025) Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images arXiv

(27 Aug 2025) 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis arXiv

(8 Aug 2025) MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models arXiv

(8 Aug 2025) InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic? arXiv

(22 Jul 2025) ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering arXiv

(22 Jul 2025) Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning arXiv

(12 Jun 2025) VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos arXiv

(12 Jun 2025) MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning arXiv

(6 Jun 2025) PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts arXiv

(5 Jun 2025) VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos arXiv

(5 Jun 2025) MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark arXiv

(15 May 2025) StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation arXiv

(13 May 2025) VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models arXiv

(1 May 2025) MINERVA: Evaluating Complex Video Reasoning arXiv

(30 Apr 2025) GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling arXiv

(21 Apr 2025) IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs arXiv

(21 Apr 2025) VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models arXiv

(17 Apr 2025) Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark arXiv

(16 Apr 2025) FLIP Reasoning Challenge arXiv

(14 Apr 2025) VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge arXiv

(8 Apr 2025) ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering arXiv

(8 Apr 2025) V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models arXiv

(8 Apr 2025) MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models arXiv

(4 Apr 2025) Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme arXiv

(15 Feb 2025) SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding arXiv

(14 Feb 2025) MM-RLHF: The Next Step Forward in Multimodal LLM Alignment arXiv

(13 Feb 2025) MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency arXiv

(18 Dec 2024) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. arXiv

(22 Nov 2024) VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection. arXiv code

(18 Oct 2024) MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps arXiv

(7 Jul 2024) VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool arXiv

(20 Jun 2024) MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding arXiv

(12 Jun 2024) LVBench: An Extreme Long Video Understanding Benchmark arXiv

(24 Apr 2024) Cantor: Inspiring Multimodal Chain-of-Thought of MLLM arXiv

(16 Apr 2024) OpenEQA: Embodied Question Answering in the Era of Foundation Models arXiv

(17 Aug 2023) EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding arXiv

(23 May 2023) Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought. arXiv

(18 May 2021) NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions arXiv

Latent

(09 Apr 2026) Multimodal Latent Reasoning via Predictive Embeddings arXiv

(02 Apr 2026) PLUME: Latent Reasoning Based Universal Multimodal Embedding arXiv

(26 Mar 2026) LanteRn: Latent Visual Structured Reasoning arXiv

(23 Mar 2026) Q-Tacit: Image Quality Assessment via Latent Visual Reasoning arXiv

(24 Feb 2026) CrystaL: Spontaneous Emergence of Visual Latents in MLLMs arXiv

(05 Feb 2026) Multimodal Latent Reasoning via Hierarchical Visual Cues Injection arXiv

(04 Feb 2026) Vision-aligned Latent Reasoning for Multi-modal Large Language Model arXiv

(28 Dec 2025) ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving arXiv

(26 Nov 2025) Monet: Reasoning in Latent Visual Space Beyond Images and Language arXiv

(22 Nov 2025) L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention arXiv

(04 Nov 2025) Multimodal Reasoning via Latent Refocusing arXiv

(29 Sep 2025) Latent Visual Reasoning arXiv

(12 Feb 2025) Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning arXiv

(7 Feb 2025) Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach arXiv

(9 Dec 2024) Training Large Language Models to Reason in a Continuous Latent Space arXiv

Open Source Project

https://github.com/Hui-design/Open-LLaVA-Video-R1

https://github.com/SkyworkAI/Skywork-R1V

https://huggingface.co/papers/2503.05379

https://github.com/Osilly/Vision-R1

https://github.com/ModalMinds/MM-EUREKA

https://github.com/OpenRLHF/OpenRLHF-M

https://github.com/Fancy-MLLM/R1-Onevision

https://github.com/om-ai-lab/VLM-R1

https://github.com/EvolvingLMMs-Lab/open-r1-multimodal

https://github.com/Deep-Agent/R1-V

https://github.com/TideDra/lmm-r1

https://github.com/tulerfeng/Video-R1

https://github.com/Wang-Xiaodong1899/Open-R1-Video