MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixt...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shen, Leyang, Chen, Gongwei, Shao, Rui, Guan, Weili, Nie, Liqiang
Format:	Journal Article
Language:	English
Published:	17-07-2024
Subjects:	Computer Science - Computer Vision and Pattern Recognition
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME
AbstractList	Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME
Author	Chen, Gongwei Shao, Rui Nie, Liqiang Shen, Leyang Guan, Weili
Author_xml	– sequence: 1 givenname: Leyang surname: Shen fullname: Shen, Leyang – sequence: 2 givenname: Gongwei surname: Chen fullname: Chen, Gongwei – sequence: 3 givenname: Rui surname: Shao fullname: Shao, Rui – sequence: 4 givenname: Weili surname: Guan fullname: Guan, Weili – sequence: 5 givenname: Liqiang surname: Nie fullname: Nie, Liqiang
BackLink	https://doi.org/10.48550/arXiv.2407.12709$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxMNczNDI3sORkCPbN93W1UvDNrCgpLUpVyE9T8C3NKcnMzU9JzFFwrShILSopVkjLL1JwT81LLUrMySwuQVbhk1iUngok89JLE4EM3_yU1JxiHgbWtMSc4lReKM3NIO_mGuLsoQu2Pr6gKDM3sagyHuSMeLAzjAmrAADn6T8T
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2407.12709
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2407_12709
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2407_127093
IEDL.DBID	GOX
IngestDate	Fri Jul 19 12:21:22 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2407_127093
OpenAccessLink	https://arxiv.org/abs/2407.12709
ParticipantIDs	arxiv_primary_2407_12709
PublicationCentury	2000
PublicationDate	2024-07-17
PublicationDateYYYYMMDD	2024-07-17
PublicationDate_xml	– month: 07 year: 2024 text: 2024-07-17 day: 17
PublicationDecade	2020
PublicationYear	2024
Score	3.857918
SecondaryResourceType	preprint
Snippet	Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computer Vision and Pattern Recognition
Title	MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
URI	https://arxiv.org/abs/2407.12709
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQSTVPMzexSDMBzQ8m65qYJaboAhuxprrAki_R2CgpNSkRfOKNR7C5X4SFiyvomBwF2F6YxKKKzDLI-cBJxfqg7oYeaG7UkpmB2cgItGTL3T8CMjkJPooLqh6hDtjGBAshVRJuggz80NadgiMkOoQYmFLzRBiCffN9Xa0UfDMrQKP1CvlpCuBdr7n5KUCl4KOGS4oVgG1HBegR0MCAR1bhA1qrDSQh44oKoMvLcopFGeTdXEOcPXTBzogvgJwZEQ9yYTzYhcZiDCzAnn2qBIOCQVKSsXmiRVpKqmWySYp5moVpsolhYlpSopFFoiWwKpVkkMBlihRuKWkGLiNgzasLPv1RhoGlpKg0VZaBuTilVA4cfAB0mHND
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MoME%3A+Mixture+of+Multimodal+Experts+for+Generalist+Multimodal+Large+Language+Models&rft.au=Shen%2C+Leyang&rft.au=Chen%2C+Gongwei&rft.au=Shao%2C+Rui&rft.au=Guan%2C+Weili&rft.date=2024-07-17&rft_id=info:doi/10.48550%2Farxiv.2407.12709&rft.externalDocID=2407_12709