So what would the right approach be to using the STT API to create a small tool like command line python:
Given that you might need the services of some spider code to enable using logins, handling cookies, etc, of a modern web site, then, when you finally have the textual web site, it might be nice to still pass this to, for instance, the chat completions API, to put the found text into a form that STT will read well – pausing only at points that make sense to pause at, doing the right thing with acronyms and personal nouns, etc.
Has anyone already made this? Comments on this initial sketch of how to do it?
The Speech API will do a pretty good job with just plain text, but you could build up all sorts of text to phonetics features to aid understanding, the web login and cookie handling would be a separate issue that you could handle with additional libraries and even get ChatGPT to assist with code creation.
I’d start with just a plain web reader without the login features just to test your idea out and then flesh that out from there.
You can ask ChatGPT for a web page reader code in python or Node.js and pretty much have a working app in a few hours of back and forth.
There is early work reported on llm → web code that can reliably manipulate web pages. Hooking that to STT would give you some ability to voice login, navigate, etc. Sorry, don’t have refs, but try searching llm web page manipulation or something like that. Pretty early stage I think.
thanks to both of you for those awesome responses so far. I’ll add back to this thread with whatever I get done on this
I guess you need to use a site like gptcrawler.thesamur.ai and feed the output to the LLM