Search Results - "Seo, Paul Hongsuck"
-
1
MarioQA: Answering Questions by Watching Gameplay Videos
Published in 2017 IEEE International Conference on Computer Vision (ICCV) (01-10-2017)“…We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed…”
Get full text
Conference Proceeding -
2
Look Before you Speak: Visually Contextualized Utterances
Published in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2021)“…While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic…”
Get full text
Conference Proceeding -
3
End-to-end Generative Pretraining for Multimodal Video Captioning
Published in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2022)“…Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new…”
Get full text
Conference Proceeding -
4
Learning for Single-Shot Confidence Calibration in Deep Neural Networks Through Stochastic Inferences
Published in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2019)“…We propose a generic framework to calibrate accuracy and confidence of a prediction in deep neural networks through stochastic inferences. We interpret…”
Get full text
Conference Proceeding -
5
Zero-shot Referring Image Segmentation with Global-Local Context Features
Published in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2023)“…Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled…”
Get full text
Conference Proceeding -
6
Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction
Published in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2016)“…We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are…”
Get full text
Conference Proceeding -
7
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Published in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2023)“…In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale…”
Get full text
Conference Proceeding -
8
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Published in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2023)“…Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training…”
Get full text
Conference Proceeding -
9
IFSeg: Image-free Semantic Segmentation via Vision-Language Model
Published in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (01-06-2023)“…Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer)…”
Get full text
Conference Proceeding -
10
Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation
Published 10-07-2024“…We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring…”
Get full text
Journal Article -
11
Zero-shot Referring Image Segmentation with Global-Local Context Features
Published 31-03-2023“…Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled…”
Get full text
Journal Article -
12
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Published 29-03-2023“…Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training…”
Get full text
Journal Article -
13
AVATAR submission to the Ego4D AV Transcription Challenge
Published 17-11-2022“…In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the…”
Get full text
Journal Article -
14
Learning Correlation Structures for Vision Transformers
Published in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (16-06-2024)“…We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query…”
Get full text
Conference Proceeding -
15
Learning Correlation Structures for Vision Transformers
Published 05-04-2024“…We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query…”
Get full text
Journal Article -
16
IFSeg: Image-free Semantic Segmentation via Vision-Language Model
Published 25-03-2023“…Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer)…”
Get full text
Journal Article -
17
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
Published 29-09-2024“…Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what…”
Get full text
Journal Article -
18
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
Published in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (16-06-2024)“…Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work,…”
Get full text
Conference Proceeding -
19
Look Before you Speak: Visually Contextualized Utterances
Published 10-12-2020“…While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic…”
Get full text
Journal Article -
20
End-to-end Generative Pretraining for Multimodal Video Captioning
Published 20-01-2022“…Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR) 2022 Recent video and language pretraining frameworks lack the ability to generate…”
Get full text
Journal Article