Search Results - "Seo, Paul Hongsuck"

Refine Results
  1. 1

    MarioQA: Answering Questions by Watching Gameplay Videos by Jonghwan Mun, Hongsuck Seo, Paul, Ilchae Jung, Bohyung Han

    “…We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed…”
    Get full text
    Conference Proceeding
  2. 2

    Look Before you Speak: Visually Contextualized Utterances by Hongsuck Seo, Paul, Nagrani, Arsha, Schmid, Cordelia

    “…While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic…”
    Get full text
    Conference Proceeding
  3. 3

    End-to-end Generative Pretraining for Multimodal Video Captioning by Seo, Paul Hongsuck, Nagrani, Arsha, Arnab, Anurag, Schmid, Cordelia

    “…Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new…”
    Get full text
    Conference Proceeding
  4. 4

    Learning for Single-Shot Confidence Calibration in Deep Neural Networks Through Stochastic Inferences by Seo, Seonguk, Seo, Paul Hongsuck, Han, Bohyung

    “…We propose a generic framework to calibrate accuracy and confidence of a prediction in deep neural networks through stochastic inferences. We interpret…”
    Get full text
    Conference Proceeding
  5. 5

    Zero-shot Referring Image Segmentation with Global-Local Context Features by Yu, Seonghoon, Seo, Paul Hongsuck, Son, Jeany

    “…Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled…”
    Get full text
    Conference Proceeding
  6. 6

    Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction by Hyeonwoo Noh, Seo, Paul Hongsuck, Bohyung Han

    “…We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are…”
    Get full text
    Conference Proceeding
  7. 7

    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning by Yang, Antoine, Nagrani, Arsha, Seo, Paul Hongsuck, Miech, Antoine, Pont-Tuset, Jordi, Laptev, Ivan, Sivic, Josef, Schmid, Cordelia

    “…In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale…”
    Get full text
    Conference Proceeding
  8. 8

    AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR by Seo, Paul Hongsuck, Nagrani, Arsha, Schmid, Cordelia

    “…Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training…”
    Get full text
    Conference Proceeding
  9. 9

    IFSeg: Image-free Semantic Segmentation via Vision-Language Model by Yun, Sukmin, Park, Seong Hyeon, Seo, Paul Hongsuck, Shin, Jinwoo

    “…Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer)…”
    Get full text
    Conference Proceeding
  10. 10

    Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation by Yu, Seonghoon, Seo, Paul Hongsuck, Son, Jeany

    Published 10-07-2024
    “…We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring…”
    Get full text
    Journal Article
  11. 11

    Zero-shot Referring Image Segmentation with Global-Local Context Features by Yu, Seonghoon, Seo, Paul Hongsuck, Son, Jeany

    Published 31-03-2023
    “…Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled…”
    Get full text
    Journal Article
  12. 12

    AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR by Seo, Paul Hongsuck, Nagrani, Arsha, Schmid, Cordelia

    Published 29-03-2023
    “…Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training…”
    Get full text
    Journal Article
  13. 13

    AVATAR submission to the Ego4D AV Transcription Challenge by Seo, Paul Hongsuck, Nagrani, Arsha, Schmid, Cordelia

    Published 17-11-2022
    “…In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the…”
    Get full text
    Journal Article
  14. 14

    Learning Correlation Structures for Vision Transformers by Kim, Manjin, Seo, Paul Hongsuck, Schmid, Cordelia, Cho, Minsu

    “…We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query…”
    Get full text
    Conference Proceeding
  15. 15

    Learning Correlation Structures for Vision Transformers by Kim, Manjin, Seo, Paul Hongsuck, Schmid, Cordelia, Cho, Minsu

    Published 05-04-2024
    “…We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query…”
    Get full text
    Journal Article
  16. 16

    IFSeg: Image-free Semantic Segmentation via Vision-Language Model by Yun, Sukmin, Park, Seong Hyeon, Seo, Paul Hongsuck, Shin, Jinwoo

    Published 25-03-2023
    “…Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer)…”
    Get full text
    Journal Article
  17. 17

    Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels by Shin, Heeseong, Kim, Chaehyun, Hong, Sunghwan, Cho, Seokju, Arnab, Anurag, Seo, Paul Hongsuck, Kim, Seungryong

    Published 29-09-2024
    “…Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what…”
    Get full text
    Journal Article
  18. 18

    CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation by Cho, Seokju, Shin, Heeseong, Hong, Sunghwan, Arnab, Anurag, Seo, Paul Hongsuck, Kim, Seungryong

    “…Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work,…”
    Get full text
    Conference Proceeding
  19. 19

    Look Before you Speak: Visually Contextualized Utterances by Seo, Paul Hongsuck, Nagrani, Arsha, Schmid, Cordelia

    Published 10-12-2020
    “…While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic…”
    Get full text
    Journal Article
  20. 20

    End-to-end Generative Pretraining for Multimodal Video Captioning by Seo, Paul Hongsuck, Nagrani, Arsha, Arnab, Anurag, Schmid, Cordelia

    Published 20-01-2022
    “…Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR) 2022 Recent video and language pretraining frameworks lack the ability to generate…”
    Get full text
    Journal Article