MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixt...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
17-07-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | Multimodal large language models (MLLMs) have demonstrated impressive
capabilities across various vision-language tasks. However, a generalist MLLM
typically underperforms compared with a specialist MLLM on most VL tasks, which
can be attributed to task interference. In this paper, we propose a mixture of
multimodal experts (MoME) to mitigate task interference and obtain a generalist
MLLM. Our MoME is composed of two key components, a mixture of vision experts
(MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate
the features transformed from various vision encoders, and has a strong
compatibility in transformation architecture. MoLE incorporates sparsely gated
experts into LLMs to achieve painless improvements with roughly unchanged
inference costs. In response to task interference, our MoME specializes in both
vision and language modality to adapt to task discrepancies. Extensive
experiments show that MoME significantly improves the performance of generalist
MLLMs across various VL tasks. The source code is released at
https://github.com/JiuTian-VL/MoME |
---|---|
AbstractList | Multimodal large language models (MLLMs) have demonstrated impressive
capabilities across various vision-language tasks. However, a generalist MLLM
typically underperforms compared with a specialist MLLM on most VL tasks, which
can be attributed to task interference. In this paper, we propose a mixture of
multimodal experts (MoME) to mitigate task interference and obtain a generalist
MLLM. Our MoME is composed of two key components, a mixture of vision experts
(MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate
the features transformed from various vision encoders, and has a strong
compatibility in transformation architecture. MoLE incorporates sparsely gated
experts into LLMs to achieve painless improvements with roughly unchanged
inference costs. In response to task interference, our MoME specializes in both
vision and language modality to adapt to task discrepancies. Extensive
experiments show that MoME significantly improves the performance of generalist
MLLMs across various VL tasks. The source code is released at
https://github.com/JiuTian-VL/MoME |
Author | Chen, Gongwei Shao, Rui Nie, Liqiang Shen, Leyang Guan, Weili |
Author_xml | – sequence: 1 givenname: Leyang surname: Shen fullname: Shen, Leyang – sequence: 2 givenname: Gongwei surname: Chen fullname: Chen, Gongwei – sequence: 3 givenname: Rui surname: Shao fullname: Shao, Rui – sequence: 4 givenname: Weili surname: Guan fullname: Guan, Weili – sequence: 5 givenname: Liqiang surname: Nie fullname: Nie, Liqiang |
BackLink | https://doi.org/10.48550/arXiv.2407.12709$$DView paper in arXiv |
BookMark | eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxMNczNDI3sORkCPbN93W1UvDNrCgpLUpVyE9T8C3NKcnMzU9JzFFwrShILSopVkjLL1JwT81LLUrMySwuQVbhk1iUngok89JLE4EM3_yU1JxiHgbWtMSc4lReKM3NIO_mGuLsoQu2Pr6gKDM3sagyHuSMeLAzjAmrAADn6T8T |
ContentType | Journal Article |
Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2407.12709 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2407_12709 |
GroupedDBID | AKY GOX |
ID | FETCH-arxiv_primary_2407_127093 |
IEDL.DBID | GOX |
IngestDate | Fri Jul 19 12:21:22 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-arxiv_primary_2407_127093 |
OpenAccessLink | https://arxiv.org/abs/2407.12709 |
ParticipantIDs | arxiv_primary_2407_12709 |
PublicationCentury | 2000 |
PublicationDate | 2024-07-17 |
PublicationDateYYYYMMDD | 2024-07-17 |
PublicationDate_xml | – month: 07 year: 2024 text: 2024-07-17 day: 17 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 3.857918 |
SecondaryResourceType | preprint |
Snippet | Multimodal large language models (MLLMs) have demonstrated impressive
capabilities across various vision-language tasks. However, a generalist MLLM
typically... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Computer Vision and Pattern Recognition |
Title | MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models |
URI | https://arxiv.org/abs/2407.12709 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQSTVPMzexSDMBzQ8m65qYJaboAhuxprrAki_R2CgpNSkRfOKNR7C5X4SFiyvomBwF2F6YxKKKzDLI-cBJxfqg7oYeaG7UkpmB2cgItGTL3T8CMjkJPooLqh6hDtjGBAshVRJuggz80NadgiMkOoQYmFLzRBiCffN9Xa0UfDMrQKP1CvlpCuBdr7n5KUCl4KOGS4oVgG1HBegR0MCAR1bhA1qrDSQh44oKoMvLcopFGeTdXEOcPXTBzogvgJwZEQ9yYTzYhcZiDCzAnn2qBIOCQVKSsXmiRVpKqmWySYp5moVpsolhYlpSopFFoiWwKpVkkMBlihRuKWkGLiNgzasLPv1RhoGlpKg0VZaBuTilVA4cfAB0mHND |
link.rule.ids | 228,230,783,888 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MoME%3A+Mixture+of+Multimodal+Experts+for+Generalist+Multimodal+Large+Language+Models&rft.au=Shen%2C+Leyang&rft.au=Chen%2C+Gongwei&rft.au=Shao%2C+Rui&rft.au=Guan%2C+Weili&rft.date=2024-07-17&rft_id=info:doi/10.48550%2Farxiv.2407.12709&rft.externalDocID=2407_12709 |