RAG input via System message: JSON vs plain text

mr.shalex · September 17, 2024, 1:14pm

I’m working with gpt-4o and gpt-4o-mini in the RAG application. Is it OK to pass RAG data in JSON format via System message?

{"URL": "https://website.com/page1", "ProductName": "MyProduct", "ContentType": "Documentation", "Content": "This page describes how to ..."}

###############

{"URL": "https://website.com/page2", "ProductName": "MyProduct2", "ContentType": "Video", "Content": "This video tutorial describes how to ..."}

Or is it much better to use plain text inputs?

URL: https://website.com/page1
ProductName: MyProduct
ContentType: Documentation
Content: This page describes how to ...

###############

URL: https://website.com/page2
ProductName: MyProduct2
ContentType: Video
Content: This video tutorial describes how to ...

mr.shalex · September 17, 2024, 7:29pm

According to my test:

gpt-4o understands that different chunks correspond to various products and gives a response based only on the chunks for the product in question. This works with both JSON and plain text.
gpt-4o-mini doesn’t respect the ProductName attribute and gives a response based on all chunks (information from other products applies to the product in question). This works with neither JSON nor plain text.

So, the format doesn’t make any difference?

merefield · September 17, 2024, 8:46pm

No, this is not how you’d normally do it and would not be in line with the docs.

RAG data is usually returned via role “tools” as a direct response to a tools call, not via a system prompt.

You can put some data in the system prompt, e.g. current date and time, or a small table of constants that is used very often, but this would not be data acquired in a “RAG” process of function call and answer.

mr.shalex · September 18, 2024, 7:35am

Thanks for your response.

Why is it recommended to make an extra round-trip with function calling if we already know that the function needs to be called and its parameter (user question) is known from the start?
Could you share the docs URL to read more about this point?

Diet · September 18, 2024, 8:06am

I think @merefield is talking about an agentic approach (a-la assistants)

but you seem to be pre-fetching documents the good ol’ way.

notes on the system prompt

fundamentally there’s no real difference between a user, system, or assistant message, although OpenAI likes to pretend like there is (for “safety” reasons). Overall, your entire conversation is just a document.

What the “system prompt” may or may not do, is shape the model’s focus. If your system prompt gets super long, the bottom part of the system prompt likely doesn’t benefit much from the fact that it’s in a system prompt.

notes on the format

Regarding format: some folks seems to recommend structured markdown, (mostly headers I think would be important) you can try if that works better with mini. But with stronger models, I don’t think it matters all that much.

notes on “Is it OK”

Everything’s OK as long as it works

Most of these structures are just artificial anyways. Underneath it all is just a completion model that doesn’t really care about these ‘rules’, but might be trained to behave in specific ways. How it actually acts can deviate from what OpenAI intended, so there aren’t any real hard guidelines here.

merefield · September 18, 2024, 8:19am

How are you achieving RAG if not via Function Calling?

How does the system know what subset of data needs to be injected into the prompt (regardless of where it is injected)?

Diet · September 18, 2024, 8:57am

embedding search?

merefield · September 18, 2024, 8:59am

Initiated by what though? …

Diet · September 18, 2024, 9:05am

The user’s message, or a chunk of the conversation for example

BTW, modern embedding models are even instructable

e.g.: NV-Embed-v2 (not openai)

task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

nvidia/NV-Embed-v2 · Hugging Face

pretty cool stuff

merefield · September 18, 2024, 9:18am

You’ve lost me.

In order to initiate an embeddings search, the architecture will need to pick up the users intent.

With function calling that can be integrated with the clients code.

I do not understand how you are proposing the embedding search is kicked off without the LLM having determined it needs to call out to a function.

Diet · September 18, 2024, 9:26am

Well, you could use embeddings to determine the user’s intent.

But it can be even simpler than that: you just do the search anyways whether it’s necessary or not, and load potentially informative information into the prompt, and let the LLM sort through it if it wants to.

The simplest example here is a FAQ bot: the chance that a user will ask a FAQ is relatively high: so why not just fetch the relevant FAQs beforehand instead of having to wait for the model to call a function that will just do the same thing anyways.

If the retrieved information is not relevant, the model will generally just ignore it. You may lose a couple of input tokens here and there, but the system will be much faster and more responsive, and it saves you the function call output tokens (and input, if you do round trip)

merefield · September 18, 2024, 9:30am

you could … but given using the LLM for function calling now is so “cheap” and fast … this would be unnecessarily basic?

Diet · September 18, 2024, 9:38am

I suppose it’s a matter of taste.

I personally still consider functions to be slow, brittle, wasteful (in terms of attention, as well as inference), and expensive (in terms of development)

But I understand that YMMV depending on your experience, comfort, task, and tooling.

I might be a fundamentalist

merefield · September 18, 2024, 9:43am

I appreciate a more basic approach might remove some potential fragility

mr.shalex · September 18, 2024, 10:08am

@Diet Thank you for the detailed response.

The context window of gpt-4o and gpt-4o-mini is 128k tokens. I add 12k tokens of documentation obtained with vector search to the system prompt . As I understand, the problem could be with information in the middle of context (LLMs better analyze leading and trailing context edges) but my 12k tokens are in the leading edge considering the whole context size.

I have just tried Markdown:

## URL

https://website.com/page1

## Product Name

MyProduct

## Content Type

Documentation

## Content

This page describes how to ...

###############

## URL

https://website.com/page2

## Product Name

MyProduct2

## Content Type

Video

## Content

This video tutorial describes how to ...

The result is the same: gpt-4o works, gpt-4o-mini fails

I agree

mr.shalex · September 18, 2024, 10:26am

@merefield Chatbot answers user questions using documentation/blog/website knowledge (250 products, ~50k pages scraped by crawler). LLM must have related information to answer each question.

Diet · September 18, 2024, 10:39am

I was gonna get into this but deleted it from my post lol

Location of RAG context within system prompt - #2 by Diet

My concern (with my current understanding) would rather be that attention is unduly influenced by position - so I’d rather push all the knowledge into the red part so that everything has an equal chance of bubbling to the forefront - the model will generally still find the information if it appears relevant enough. however it probably won’t make all that much difference.

That seems pretty confusing at a glance, I’d try to structure it in a more readable way:

# Myproduct2 Video

Myproduct2 : This video tutorial describes how to ...

Link: [https://website.com/page2](Myproduct2 Video URL)
------

I can’t guarantee this will work with mini, but this is my general approach:

A picture of a monkey with an exposed brain illustration, accompanied by the caption "Neuron activation." (Captioned by AI) e

in long context problems (12k is long context imo) try to construct text blocks that generate a maximum specific activation in a certain area, and try to make them as semantically unsimilar as possible from other blocks. Then, you want to make sure that that particular block generates a maximal activation based on your immediate context (the tail of the generation)

That’s how Chain of Thought or “Think Step By Step” works:

if the issue is that a user asks “what’s the url for product 2?” and the model responds with any URL, it might be a good idea to get the model to write out a short summary of product 2:

(the model should be able to find the short description of product2)

Myproduct2 : is a video that describes how to …

that then increases the activation to the whole product 2 description block (or more importantly, decreases the attention directed towards unrelated blocks)

and once you have the attention on the entire block, you can retrieve the URL.

I’ll admit this is a bit contrived, but the idea is that instead of fetching ambiguous data, you take some time to amplify a concept before digging through your deeper context.

This can take some time to get right with a weak model. You generally always trade inference cost for development cost

Other stuff you can of course do:

try to improve your data, to remove uninformative boilerplate and fluff
try to pre-select stuff before giving it to the conversational LLM (present fewer options)
use a better embedding model to return fewer, but more relevant results

mr.shalex · September 18, 2024, 12:39pm

12k system prompt + 5k conversation (summarized if exceeds 5k) < 20k

So all my data has the highest score 5:

Looks like it doesn’t matter where to place the prompt in my case.

Diet:

I’d try to structure it in a more readable way:

# Myproduct2 Video

Myproduct2 : This video tutorial describes how to ...

Link: [https://website.com/page2](Myproduct2 Video URL)
------

Unfortunately, this doesn’t help (gpt-4o works, gpt-4o-mini fails). Conclusion: format (JSON, Markdown, plain text) doesn’t matter but the model does.

Just thoughts:

the crawler scrapes only particular tags containing main content on the docs/blog/website pages and skips header/footer/navigation/sidebar, so the data is “clean”
there are a lot of products, and users may not know the correct name or may misspell it. They might also be looking for the feature in one product, but the chatbot should recommend another product where that feature is available
text-embedding-3-large is used for generating embeddings. If OpenAI releases a better one, we will use that

merefield · September 18, 2024, 2:15pm

yes, that’s obvious.

the point is you’d normally (in our new modern world) use Function Calling to achieve that.

How else are you invoking the search?

mr.shalex · September 18, 2024, 2:44pm

I work with Chat Completions API. As LLM is stateless, the conversation is stored in my database. For each API call, the chatbot retrieves all messages within the current session and the result of vector search for the latest user message.

Why not Assistants API? With Chat Completions API, I have more control over conversation:

do vector search on my side (can debug and check its results anytime) instead of Assistants’ File Search
decide when and how to summarize the conversation
pre-process user message (additional API call) to get RAG search phrase based on the latest message holding the context of the previous messages

Topic		Replies	Views
LLM forgetting part of my prompt with too much data Prompting chatgpt , prompt	17	11118	May 25, 2024
What's the most accurate? Fine tunning vs Prompt Stuffing Community fine-tuning	13	5162	October 2, 2023
Getting Frustrated - starting to feel OpenAI just isn't usable API	69	11832	November 24, 2023
Assistants: Async tool submissions API tool , assistants-api	58	1634	August 16, 2024
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	48650	December 12, 2023

RAG input via System message: JSON vs plain text

notes on the system prompt

notes on the format

notes on “Is it OK”

Related topics