Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of un...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
04-04-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Large language models (LLMs) have exhibited impressive performance in
language comprehension and various reasoning tasks. However, their abilities in
spatial reasoning, a crucial aspect of human cognition, remain relatively
unexplored. Human possess a remarkable ability to create mental images of
unseen objects and actions through a process known as the Mind's Eye, enabling
the imagination of the unseen world. Inspired by this cognitive capacity, we
propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial
reasoning of LLMs by visualizing their reasoning traces, thereby guiding
subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning
tasks, including natural language navigation, visual navigation, and visual
tiling in 2D grid worlds. Experimental results demonstrated that VoT
significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT
outperformed existing multimodal large language models (MLLMs) in these tasks.
While VoT works surprisingly well on LLMs, the ability to generate mental
images to facilitate spatial reasoning resembles the mind's eye process,
suggesting its potential viability in MLLMs. Please find the dataset and codes
at https://microsoft.github.io/visualization-of-thought |
---|---|
DOI: | 10.48550/arxiv.2404.03622 |