Quick Read: MM1 Study: Elevating Multimodal LLMs through Pre-training Innovations

2 min readJul 11, 2024

✨ A very insightful study with #SOTA performant #MLLMs.
In the paper, post #SFT, while enabling #multi_Image #reasoning and #few_shot_prompting the model family produces competitive performance.

Picture Credit https://arxiv.org/pdf/2403.09611

📌 After incisions at several parts such as at model architecture decisions and pretraining data choices, there were few main Observations/Learnings/Highlights that I could pick during my quick read:

✏ #MM1 includes both dense models up to 30B parameters and #MoE variants

✏ Image encoder/resolution/token count has the substantial impact on performance, more than model size

✏ Specific Vision Language connector design has no significant effect when compared to number of Visual tokens, image resolution.

✏ Data decision on Mixing interleaved image-text, caption, and text-only data in pre-training is crucial

✏ 5:5:1 ratio of caption, interleaved, and text data works best

✏ High quality synthetic caption data gives boost in few-shot performance.

✏ Post supervised finetuning, achieves competitive SOTA on VQA in few

shot and captioning tasks with MM benchmarks.

Quick Read: MM1 Study: Elevating Multimodal LLMs through Pre-training Innovations

Written by QvickRead

No responses yet