Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations i...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
24-05-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Large Vision-Language Models (LVLMs) often produce responses that misalign
with factual information, a phenomenon known as hallucinations. While
hallucinations are well-studied, the exact causes behind them remain
underexplored. In this paper, we first investigate the root causes of
hallucinations in LVLMs. Our findings reveal that existing mitigation
techniques primarily reduce hallucinations for visual recognition prompts-those
that require simple descriptions of visual elements-but fail for cognitive
prompts that demand deliberate reasoning. We identify the core issue as a lack
of true visual perception in LVLMs: although they can accurately recognize
visual elements, they struggle to fully interpret these elements in the context
of the input prompt and effectively link this recognition to their internal
knowledge, which is critical for reasoning. To address this gap, we introduce
Visual Description Grounded Decoding (VDGD), a simple, robust, and
training-free method designed to enhance visual perception and improve
reasoning capabilities in LVLMs. VDGD works by first generating a detailed
description of the image and appending it as a prefix to the instruction.
During response generation, tokens are sampled based on their KL divergence to
the description, favoring candidates with lower divergence. Experimental
results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD
consistently outperforms existing baselines 2% - 33%. Finally, we introduce
VaLLu, a benchmark designed for comprehensive evaluation of the cognitive
capabilities of LVLMs. |
---|---|
DOI: | 10.48550/arxiv.2405.15683 |