Fine-Tuning an Existing VLM vs. Creating a Custom Pipeline

marwaneladawii · March 28, 2025, 10:42pm

Hi everyone ,

I’m working on a Vision-Language Model (VLM) project that involves processing images with both text and visual elements to provide meaningful outputs, such as explanations or answers to questions.

I’m considering two approaches:

Fine-tuning an existing VLM to adapt it to my specific requirements.
Building a custom pipeline.

Key considerations:

The task involves handling mixed content, including text and diagrams.
Resources are limited.

What would be the best approach to achieve a robust and efficient solution? Any advice on models, datasets, or fine-tuning strategies would be greatly appreciated!

Topic		Replies	Views
Fine tuning model for custom entity extraction API fine-tuning	1	1665	May 11, 2023
Is Fine Tuning the right approach for me? Community fine-tuning , fine-tuning-problems	0	579	November 30, 2023
Building Own Knowledge Base LLM Community embeddings , chatgpt , api , assistants-api	3	10607	April 8, 2024
Best practice to retrain the already trained model using fine tuning API fine-tuning	0	1420	May 14, 2024
Teaching GPT a new/niche programming language API	1	1869	June 2, 2023

Fine-Tuning an Existing VLM vs. Creating a Custom Pipeline

Related topics