Learning Joint Representation of Human Motion and Language

In this work, we present MoLang (a Motion-Language connecting model) for learning joint representation of human motion and language, leveraging both unpaired and paired datasets of motion and language modalities. To this end, we propose a motion-language model with contrastive learning, empowering o...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kim, Jihoon, Yu, Youngjae, Shin, Seungyoun, Byun, Taehyun, Choi, Sungjoon
Format:	Journal Article
Language:	English
Published:	27-10-2022
Subjects:	Computer Science - Computer Vision and Pattern Recognition
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this work, we present MoLang (a Motion-Language connecting model) for learning joint representation of human motion and language, leveraging both unpaired and paired datasets of motion and language modalities. To this end, we propose a motion-language model with contrastive learning, empowering our model to learn better generalizable representations of the human motion domain. Empirical results show that our model learns strong representations of human motion data through navigating language modality. Our proposed method is able to perform both action recognition and motion retrieval tasks with a single model where it outperforms state-of-the-art approaches on a number of action recognition benchmarks.
DOI:	10.48550/arxiv.2210.15187