How to Fine-Tune without Fine-Tuning -- Or, How to Make your RAG Implementation Smarter

Deepening Comprehension through Complementary Content

Imagine this. An alien lands on earth and asks you to explain America to it. You give it the Declaration of Independence, US Constitution and Bill of Rights.

That gives it a legal knowledge of the country and its history. But, to give the alien a fuller picture, you’d need to give it:

  • The Federalist Papers
  • The Gettysburg Address
  • John F Kennedy’s “We do it because it’s hard…” speech.
  • Martin Luther King’s “I Have A Dream” speech.

It’s a weird analogy, I know, but hopefully you get the point.

I have built one, and am working on two more regulatory knowledge bases using the RAG (Retrieval Augmentation Generation, a.k.a. “Embeddings”) system. The first application I developed was a database of California real estate law. For all intents and purposes, the OpenAI models we use know everything there is to know about CA real estate law that is actually in the documents it has access to. Which is great.

However, whatever terms, phrases, contexts, inferred meanings that are NOT explicitly found in these documents, our models do NOT know. If you use the analogy of “Data” from Star Trek, he knows everything there is to know about StarFleet regulations, command procedure, starship technology, the known universe… But, he lacks human emotion. The AI models not only lack human emotion, they lack human intuition and a sense of the world outside of the documents they have been trained on.

So, the question becomes, how do we make the models smarter and more “aware” and more capable of answering questions which, while not explicitly defined in its supplied documents, may still nonetheless be “implicit” in the meaning of the documentation?

The first thing that comes to mind is “fine-tuning.” I have not personally tried fine-tuning an OpenAI model, so I can’t say for certain how effective it would be for making a model “smarter” in the sense I’ve defined. However, from what I’ve seen so far on the OpenAI Developer Forum, there don’t seem to be many glowing success stories about using fine-tuning to expand a model’s contextual understanding.

As I currently understand it, fine-tuning is more focused on influencing the stylistic consistency of a model’s responses, rather than expanding its conceptual knowledge or ability to make contextual inferences. The fine-tuning process allows you to nudge the model toward a certain tone or stylistic approach, but may not help it infer ideas and meanings that aren’t directly contained in its training data.

So, how to make a model more capable of answering a wider range of questions without excessive prompt gymnastics, temperature settings, or opening it up to the Internet (thus inviting greater chances of “hallucination”)?

The answer I’ve found is through a method I call “Deepening Comprehension through Complementary Content”.

The key to expanding an AI model’s understanding without resorting to excessive prompting or opening it up to the entire internet is providing complementary content that deepens its comprehension of the core subject matter.

For my real estate law application, the base training data consisted of the main statutes, regulations, and case law. This gave the model a solid legal foundation. However, there were still gaps in its ability to make inferences and contextual connections.

To fill those gaps, I identified additional content that complements the core documents - things like definition dictionaries, blog posts analyzing issues, case summaries, FAQs, and more. These materials discuss the law in greater detail, from different perspectives, providing context and connections that aren’t spelled out in the regulations themselves.

Another thing I did was use the AI to create a “Fact Sheet” for it to draw on. That is, I created a list of law facts, or “answers” if you will, and augmented them with multiple (usually at least 10) questions that are answered by the “answer”. When this Fact Sheet is added to the embeddings, it now expands the range of questions that will likely hit a cosine similarity with this “answer”.

By ingesting these auxiliary and explanatory materials, the model gains a deeper implicit understanding of the core concepts and how they relate to each other. This allows it to better handle questions and scenarios that go beyond just the explicit facts found within the base training data.

In essence, complementary content enhances comprehension. It gives the model more conceptual background so it can develop a nuanced, inferential grasp of the subject matter. This approach has proven far more effective than just prompting, temperature tweaking, or opening the model up to the internet. The key is curating quality supplementary sources that expand understanding of the core knowledge base.
One of the things I have noted on the OpenAI Developer Forum is how, after the initial love affair, more and more people have become disenchanted with the typical results from “Chat with your PDF” implementations. It’s not that these implementations are doing anything wrong – they are doing exactly what they advertise: Allowing you to chat with your PDFs. But, in many instances, especially in the case of contracts, regulations, laws, statutes and codes, you need the models to be more intuitive (without being overly creative). Less like HAL 3000, Data or Isaac and more like Mr. Spock.

At the end of the day, AI models are only as smart as the data we give them. While modern large language models can ingest and comprehend massive amounts of text, they still have limitations in making inferences and nuanced connections from sparse or strictly factual data. By judiciously supplementing core documents with explanatory materials like summaries, analyses, and FAQs, we can expand the model’s implicit understanding and allow it to handle a greater range of contextual questions. While not a magic bullet, curating complementary content to deepen comprehension has proven an effective method for getting more out of AI applications without resorting to excessive tuning or opening Pandora’s box to the entire internet. With the right balance of foundational knowledge and elucidating resources, we can craft models that are handy, helpful, and hopefully a bit more human.

What I wrote, the way I wrote it, was designed for the average lay person coming up on the technology. But I have to admit, this version is eminently more appropriate for this forum. I think I’ll be AI-ing a lot more of my stuff! Thank you!

Please don’t, anyone can generate the AI version of your work for themselves. I do not want the forum to fill up with AI versions of existing work.