Blog

Fine-tuning a Multimodal Model Using SFT (Single or Multi-Image Dataset)

2025.04.20

·Hugging Face·by Anonymous

#Multimodal Model#SFT#Fine-tuning#VLM#LLM

Key Points

1This guide details the process of fine-tuning multimodal language models, specifically Gemma 3, using Supervised Fine-Tuning (SFT) within the TRL library.
2It covers two scenarios: single-image + text datasets (like LLaVA Instruct Mix) and multi-image + text datasets (like MMIU-Benchmark), emphasizing the necessary data preprocessing and conversational formatting for multi-image inputs.
3The fine-tuning procedure involves setting up the environment, loading datasets, preparing the model with BitsAndBytes and QLoRA, configuring training arguments with SFTConfig, and implementing a custom `collate_fn` to handle and mask multimodal inputs.

load_in_4bit=True

Blog

2025.04.20

·Hugging Face·by Anonymous

#Multimodal Model#SFT#Fine-tuning#VLM#LLM

1This guide details the process of fine-tuning multimodal language models, specifically Gemma 3, using Supervised Fine-Tuning (SFT) within the TRL library.
2It covers two scenarios: single-image + text datasets (like LLaVA Instruct Mix) and multi-image + text datasets (like MMIU-Benchmark), emphasizing the necessary data preprocessing and conversational formatting for multi-image inputs.
3The fine-tuning procedure involves setting up the environment, loading datasets, preparing the model with BitsAndBytes and QLoRA, configuring training arguments with SFTConfig, and implementing a custom `collate_fn` to handle and mask multimodal inputs.

load_in_4bit=True