How can I integrate Whisper Large v3 into Qwen or InternVL3?

HasanurR · April 23, 2025, 3:46am

Hey everyone,

I’ve been experimenting with some LLMs lately, and I’m super curious about combining Whisper Large v3 with models like Qwen or InternVL3 into a one single model to handle audio input directly.

A few questions I’d love to get input on:

Has anyone tried something like connecting Whisper to Qwen or any multimodal model like InternVL3?
Would this need fine-tuning or some kind of adapter/bridge?
Any working examples, repos, or code snippets that show how to connect Whisper to a text-based or multimodal LLM?

If anyone could share a basic implementation example or even a sketch of the architecture, that’d be awesome. I’m trying to learn how this pipeline could work practically.

curt.kennedy · April 23, 2025, 3:55am

I would start out with an API based transcription first, like gpt-4o-transcribe , gpt-4o-mini-transcribe , and whisper-1. Just to get things going.

Personally I find whisper-1 to be the most accurate of the three, even though it’s the oldest model.

There is no fine-tuning or anything special needed. Just follow the docs, it should be super easy, maybe 10 minutes of work to get the data flowing.

After this, if you want to use the HuggingFace version, there is a bit more work, and you’d need to decide how you want to run it, local or cloud, and what hardware. If you have low volume, you could just run on one of their A100’s for cheap (like 9 bucks a month), but I haven’t tried it myself with Whisper, but I have ran other models in cloud A100’s.

HasanurR · April 23, 2025, 4:47am

Yeah, whisper-1 might be good but I want to use Large-v3. What I want is to combine Whisper parameters to Qwen or other llms (or in other words multimodal).

curt.kennedy · April 23, 2025, 2:38pm

Merging across modalities and architectures is hard, and you would inevitably need to do some model surgery.

If you had the same LLM architecture (LLM’s only I believe), you could try mergekit.

But here with different architectures and modalities, you may have to do your research, maybe vibe-code your way towards a solution. But I am not aware of any magical library sitting on the shelf for this.

Topic		Replies	Views
Speech to Text (ASR) Strategy Community whisper , audio , gpt-4o-audio-preview	8	287	March 10, 2025
How can I use the new whisper large-v3 model via API? API whisper	3	8115	March 6, 2024
Best solution for Whisper diarization/speaker labeling? API whisper	19	37129	December 18, 2024
Adding a TRAINED Language (WER 6.9) to the OpenAI Whisper APIs API whisper	0	110	August 29, 2024
Seeking Guidance on Text-to-Speech (TTS): Need Help and Advice Community whisper , tts	2	1290	June 13, 2024

How can I integrate Whisper Large v3 into Qwen or InternVL3?

Related topics