HomE: Homography-Equivariant Video Representation Learning
Recent advances in self-supervised representation learning have enabled more efficient and robust model performance without relying on extensive labeled data. However, most works are still focused on images, with few working on videos and even fewer on multi-view videos, where more powerful inductiv...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
02-06-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Recent advances in self-supervised representation learning have enabled more
efficient and robust model performance without relying on extensive labeled
data. However, most works are still focused on images, with few working on
videos and even fewer on multi-view videos, where more powerful inductive
biases can be leveraged for self-supervision. In this work, we propose a novel
method for representation learning of multi-view videos, where we explicitly
model the representation space to maintain Homography Equivariance (HomE). Our
method learns an implicit mapping between different views, culminating in a
representation space that maintains the homography relationship between
neighboring views. We evaluate our HomE representation via action recognition
and pedestrian intent prediction as downstream tasks. On action classification,
our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than
most state-of-the-art self-supervised learning methods. Similarly, on the STIP
dataset, we outperform the state-of-the-art by 6% for pedestrian intent
prediction one second into the future while also obtaining an accuracy of 91.2%
for pedestrian action (cross vs. not-cross) classification. Code is available
at https://github.com/anirudhs123/HomE. |
---|---|
DOI: | 10.48550/arxiv.2306.01623 |