Is RAG really only for factual recall?

I’ve built a daily comedy podcast where all content is generated by a mix of OAI and OS models, with the final script as a JSON doc. It features multiple characters, each with their own prose style.

Currently this is done by few-shot; passing a pseudo chat history with each request. Which works pretty well but gets expensive. Since i now have a pile of generated scripts, I’m curious about setting up a round-robin whereby generated scripts are embedded and retrieved during subsequent runs.

All the RAG stuff I’ve seen is about topical recall, rather than style reference. But could it work for my purposes? I know there’s also fine-tuning, but I’m not ready for that yet.

1 Like

RAG for assistants is shrouded in a bit of mystery, but here’s what Bing Chat has to say on the topic.

Retrieval Augmented Generation (RAG) uses embeddings to encode the input query and the retrieved documents. The encoded vectors are then used to generate a response. Embeddings allow the model to understand the semantic similarity between different pieces of text, which is crucial for effective information retrieval and response generation

So essentially it doesn’t read the document as it would a normal prompt, it’s more similar to using ctrl + F to find information within a document.

If you’re generating a significant amount of responses fine-tuning is the way to go for imitating a style. Otherwise, just try to refine your system prompt so the model writes in the style you want with the least tokens.

I think it is possible. Consider this scenario:

user: tell a joke in style of persona A.

call api with function get_style { style: "persona A", prompt: "tell a joke" }

run embedding to get "persona A" description or conversation history cracking joke.

append the result to system prompt and with instruction to behave like the result.

call the api with updated system prompt (maybe without function calling) using either original prompt or just "tell a joke"