Simple hack to make LLMs listen to an audio rather than only reading

Hey folks :waving_hand:

I recently wrote a blog on how LLMs can “listen” to audio — not just read transcripts — by combining Whisper outputs with pitch, RMS, and other acoustic features.

:link: Read it here: LLMs Meet Audio: Teaching AI to Hear Emotion, Not Just Read It

Would love to get feedback from others working with Whisper, embeddings, or sentiment from speech!

What if we used a more advanced model to gauge the speaker’s mood directly from the initial audio sample, then used that insight to refine the baseline annotation?

This could improve accuracy while keeping costs low.