MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixt...

Full description

Saved in:
Bibliographic Details
Main Authors: Shen, Leyang, Chen, Gongwei, Shao, Rui, Guan, Weili, Nie, Liqiang
Format: Journal Article
Language:English
Published: 17-07-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME
AbstractList Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME
Author Chen, Gongwei
Shao, Rui
Nie, Liqiang
Shen, Leyang
Guan, Weili
Author_xml – sequence: 1
  givenname: Leyang
  surname: Shen
  fullname: Shen, Leyang
– sequence: 2
  givenname: Gongwei
  surname: Chen
  fullname: Chen, Gongwei
– sequence: 3
  givenname: Rui
  surname: Shao
  fullname: Shao, Rui
– sequence: 4
  givenname: Weili
  surname: Guan
  fullname: Guan, Weili
– sequence: 5
  givenname: Liqiang
  surname: Nie
  fullname: Nie, Liqiang
BackLink https://doi.org/10.48550/arXiv.2407.12709$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxMNczNDI3sORkCPbN93W1UvDNrCgpLUpVyE9T8C3NKcnMzU9JzFFwrShILSopVkjLL1JwT81LLUrMySwuQVbhk1iUngok89JLE4EM3_yU1JxiHgbWtMSc4lReKM3NIO_mGuLsoQu2Pr6gKDM3sagyHuSMeLAzjAmrAADn6T8T
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2407.12709
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2407_12709
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2407_127093
IEDL.DBID GOX
IngestDate Fri Jul 19 12:21:22 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2407_127093
OpenAccessLink https://arxiv.org/abs/2407.12709
ParticipantIDs arxiv_primary_2407_12709
PublicationCentury 2000
PublicationDate 2024-07-17
PublicationDateYYYYMMDD 2024-07-17
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-17
  day: 17
PublicationDecade 2020
PublicationYear 2024
Score 3.857918
SecondaryResourceType preprint
Snippet Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computer Vision and Pattern Recognition
Title MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
URI https://arxiv.org/abs/2407.12709
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQSTVPMzexSDMBzQ8m65qYJaboAhuxprrAki_R2CgpNSkRfOKNR7C5X4SFiyvomBwF2F6YxKKKzDLI-cBJxfqg7oYeaG7UkpmB2cgItGTL3T8CMjkJPooLqh6hDtjGBAshVRJuggz80NadgiMkOoQYmFLzRBiCffN9Xa0UfDMrQKP1CvlpCuBdr7n5KUCl4KOGS4oVgG1HBegR0MCAR1bhA1qrDSQh44oKoMvLcopFGeTdXEOcPXTBzogvgJwZEQ9yYTzYhcZiDCzAnn2qBIOCQVKSsXmiRVpKqmWySYp5moVpsolhYlpSopFFoiWwKpVkkMBlihRuKWkGLiNgzasLPv1RhoGlpKg0VZaBuTilVA4cfAB0mHND
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MoME%3A+Mixture+of+Multimodal+Experts+for+Generalist+Multimodal+Large+Language+Models&rft.au=Shen%2C+Leyang&rft.au=Chen%2C+Gongwei&rft.au=Shao%2C+Rui&rft.au=Guan%2C+Weili&rft.date=2024-07-17&rft_id=info:doi/10.48550%2Farxiv.2407.12709&rft.externalDocID=2407_12709