Hi everyone ,
I’m working on a Vision-Language Model (VLM) project that involves processing images with both text and visual elements to provide meaningful outputs, such as explanations or answers to questions.
I’m considering two approaches:
- Fine-tuning an existing VLM to adapt it to my specific requirements.
- Building a custom pipeline.
Key considerations:
- The task involves handling mixed content, including text and diagrams.
- Resources are limited.
What would be the best approach to achieve a robust and efficient solution? Any advice on models, datasets, or fine-tuning strategies would be greatly appreciated!