Whisper and applying function calling for navigating mobile GUIs (Android, iOS, Flutter)

I’ve written an article about using function calling for mobile assistance.

Mostly it focuses on natural language interpretation in connection with the GUI. But the text is first to be taken from a speech recognizer. This is the main bottleneck for the approach. I’ve tried Whisper. It is pretty good, but not so good at names, for instance. It may also be because I use it in Dutch, which it may be less knowledgeable about. I noticed the Google speech recognizer is way better at recognizing street addresses, for instance (the default is also a lot better than Chirp). I’ve tried Deepgram. despite their claims… not good, not fast at all. Im currently exploring model adaptation for Google speech recognition
Does anyone have some advice, good experience or good reads on this subject?