Fine-Tuning an Existing VLM vs. Creating a Custom Pipeline

Hi everyone :waving_hand:,

I’m working on a Vision-Language Model (VLM) project that involves processing images with both text and visual elements to provide meaningful outputs, such as explanations or answers to questions.

I’m considering two approaches:

  1. Fine-tuning an existing VLM to adapt it to my specific requirements.
  2. Building a custom pipeline.

Key considerations:

  • The task involves handling mixed content, including text and diagrams.
  • Resources are limited.

What would be the best approach to achieve a robust and efficient solution? Any advice on models, datasets, or fine-tuning strategies would be greatly appreciated!