Quick Read: MM1 Study: Elevating Multimodal LLMs through Pre-training Innovations
✨ A very insightful study with #SOTA performant #MLLMs.
In the paper, post #SFT, while enabling #multi_Image #reasoning and #few_shot_prompting the model family produces competitive performance.
📌 After incisions at several parts such as at model architecture decisions and pretraining data choices, there were few main Observations/Learnings/Highlights that I could pick during my quick read:
✏ #MM1 includes both dense models up to 30B parameters and #MoE variants
✏ Image encoder/resolution/token count has the substantial impact on performance, more than model size
✏ Specific Vision Language connector design has no significant effect when compared to number of Visual tokens, image resolution.
✏ Data decision on Mixing interleaved image-text, caption, and text-only data in pre-training is crucial
✏ 5:5:1 ratio of caption, interleaved, and text data works best
✏ High quality synthetic caption data gives boost in few-shot performance.
✏ Post supervised finetuning, achieves competitive SOTA on VQA in few
shot and captioning tasks with MM benchmarks.