Hey everyone,
I’ve been experimenting with some LLMs lately, and I’m super curious about combining Whisper Large v3 with models like Qwen or InternVL3 into a one single model to handle audio input directly.
A few questions I’d love to get input on:
-
Has anyone tried something like connecting Whisper to Qwen or any multimodal model like InternVL3?
-
Would this need fine-tuning or some kind of adapter/bridge?
-
Any working examples, repos, or code snippets that show how to connect Whisper to a text-based or multimodal LLM?
If anyone could share a basic implementation example or even a sketch of the architecture, that’d be awesome. I’m trying to learn how this pipeline could work practically.
1 Like
I would start out with an API based transcription first, like gpt-4o-transcribe
, gpt-4o-mini-transcribe
, and whisper-1
. Just to get things going.
Personally I find whisper-1
to be the most accurate of the three, even though it’s the oldest model.
There is no fine-tuning or anything special needed. Just follow the docs, it should be super easy, maybe 10 minutes of work to get the data flowing.
After this, if you want to use the HuggingFace version, there is a bit more work, and you’d need to decide how you want to run it, local or cloud, and what hardware. If you have low volume, you could just run on one of their A100’s for cheap (like 9 bucks a month), but I haven’t tried it myself with Whisper, but I have ran other models in cloud A100’s.
1 Like
Yeah, whisper-1 might be good but I want to use Large-v3. What I want is to combine Whisper parameters to Qwen or other llms (or in other words multimodal).
1 Like
Merging across modalities and architectures is hard, and you would inevitably need to do some model surgery.
If you had the same LLM architecture (LLM’s only I believe), you could try mergekit.
But here with different architectures and modalities, you may have to do your research, maybe vibe-code your way towards a solution. But I am not aware of any magical library sitting on the shelf for this.
1 Like