How would you go about implementing this? Am I hallucinating?

inner-ikigai · November 22, 2024, 3:49pm

Hello Everyone

Context:
I’m a father and my kids have this unending thirst for knowledge / ask questions all day long. I want to build a companion app for kids where they can converse with the AI in a safe manner.

The vision is a simple iOS app where the user can have natural flowing conversations with the AI character (voice only, anywhere between 15-45 minutes at a time).

At first I came across advanced voice mode, but I realized there is no API for it. Then I saw Realtime API, but it looks like it’s just too expensive to make my app idea feasible/profitable (I’m planning to have a subscription model for around 10 bucks a month).

I’m a dev but new to the AI game, any ideas around a cost effective implementation would be super appreciated.

The only requirements are:

natural sounding voice & natural flow (as in you speak, they reply, you speak etc, without having to touch any buttons)
Can’t be overly expensive, as the plan is to make it a consumer app at some point

What would be the best way to accomplish the above in a cost effective way?

One of my friends mentioned this: user voice → speech to text service → send to open AI LLM → text to voice service → return to user ?

I think a little bit of a delay isn’t a huge issue, since it is an app for kids & I can make a cute lil animation of the character thinking for a second before replying etc.

I’ll continue doing my homework, but any help would be super appreciated

Thank you guys!!

anon10827405 · November 22, 2024, 4:37pm

You can use some of available technologies to bootstrap a functional product like LiveKit, Agora, Daily etc.

I would highly recommend using RealTime API. The latency difference is completely worth the price point. Not to mention that historically OpenAI has been capable of dramatically reducing costs of their models.

The trick for now is to micro-manage the conversation.

However, these services all offer the typical STT → LLM → TTS paradigm, which is perfectly functional for a user that is understanding of the flow.

For kids, however, there’s a lot of edge cases that will completely ruin the experience.

The functionalities that RealTime API brings provide something of a possibility to mitigate these edge cases, mostly with interruptions.

The STT → LLM → TTS do not have solutions. I have yet to see something that is capable of delivering quality responses while also managing the inevitable overlapping in voices without being overengineered and consequently bringing it even more edge cases.

scharleswatson · November 22, 2024, 8:53pm

Might be best to teach kids how to do effective research. If you just give them some LLM wrapper they will be “learning” from the output which can easily include wrong answers and hallucinations, and that is a slippery slope. The ability to use quality research methods and critically address information is far more important than some sort of half-baked “know everything” bot spitting out random nonsense, IMHO.

inner-ikigai · November 23, 2024, 8:03pm

Thank you for the reply Ronald, I took a look at LiveKit, Agora and Daily and their prices look incredibly low - I have a question about that to be honest.

Can you help me understand 1 thing. Are they really cheaper than just using straight up API’s from Open AI or Claude? and handling the TTS / STT myself? I can’t see how that’s possible, but it looks like it is very cheap.

I assume I am misunderstanding something in the pricing of the tools you mentioned

On a side note, I will consider using Realtime API for the benefits you mentioned, but at the same time I do want to investigate TTS/STT implementations because of the current cost situation.

anon10827405 · November 23, 2024, 8:05pm

Their prices do not include the agents that you attach to them. So you would actually need to add their cost WITH whatever agent you decide to use with it.

inner-ikigai · November 23, 2024, 8:19pm

Gotcha, so in general you agree it would be cheaper if I implemented the entire stack (stt → send to llm → parse return → tts) ?

I assume yes but just wanted to get your 2 cents - and thanks again!

anon10827405 · November 23, 2024, 8:27pm

Yes, 100%

but keep in mind that it’s not exactly the RealTime API that I’m saying is the ideal choice, it’s the concept of the framework that it’s built on.

If you build a system using this paradigm you are going to be stuck in the past and will eventually need to restructure everything.

In reality, especially when dealing with a target audience of kids you will be burdened with constant edge cases that cannot be solved using STT → LLM → TTS.

I can almost guarantee that anyone trying to rapidly prototype the above paradigm will find themselves trapped in production inferno, constantly chasing the dragon’s tail through over-engineering

inner-ikigai · November 23, 2024, 8:44pm

Make sense, thanks Ronald!

I totally see the benefits of Realtime API, and I am leaning towards using it (if I can make the pricing work).

Just to clarify, my question about pricing above was in regards to the other providers you mentioned such as Agora and Daily.

anon10827405 · November 23, 2024, 9:01pm

No problem.

You can rapidly prototype both options here:

Topic		Replies	Views
Introducing the Realtime API Announcements	28	8873	January 16, 2025
A conversation using the API API	6	3046	December 16, 2023
Need help on how to approach the API usage metric for user of the app API	16	1864	January 3, 2024
Building the Ultimate Chatbot: What Do You Think of My Strategy? API	30	6629	December 18, 2023
Economics of Public-Facing AI API	3	800	February 9, 2023

How would you go about implementing this? Am I hallucinating?

Related topics