How to format context documents to allow model to recognize specific fields within documents

Working with gpt3.5-turbo API. I configure the text of my context documents like this:

// Construct the context document string with labeled elements
$documentString = "Document Title: {$documentTitle}. ";
$documentString .= "Content: {$contextDocument}. ";
$documentString .= "Event Date: {$documentDate}. ";
$documentString .= "Document Groups: {$documentGroups}. ";
$documentString .= “Document Taxonomy/Tags: {$documentTaxonomy}.”;
$documentString .= “Source document summary: {$documentSummary}.”;
$documentString .= “URL: {$documentURL}.”;

So, the full document context submitted in the prompt will look something like this:

Question: What documents are available with the title “Drupal AI SolrAI - CSS”? Number of context documents: 2

Context document 1: Document Title: Drupal AI SolrAI - CSS. Content: blah, blah… . Event Date: . Document Groups: Documentation, Development, D9 Development, AI. Document Taxonomy/Tags: .Source document summary: blah, blah…URL: https://master1and1-9.schoolboard.net/node/1445. -----

Context document 2: Document Title: AI: Drupal SolrAI Module. Content: blah, blah… Event Date: . Document Groups: Documentation, Development, D9 Development, AI. Document Taxonomy/Tags: AI ToDo.Source document summary: blah, blah…URL: https://master1and1-9.schoolboard.net/node/1402. -----

I use “-----” as the context document separator. The model seems to understand this.

I am trying to generate a chat completion where I ask the model to recognize the “Title” field in the context documents submitted to it. Currently, with the above format, I get:

I’m sorry, but I couldn’t find any documents with the exact title “Drupal AI SolrAI - CSS” in the provided context documents. Please try rephrasing your question or consider using a keyword search to find the desired documents.

Where, clearly, there is a document titled 'Drupal AI SolrAI - CSS" in the context documents supplied above.

Clearly, I need a better way to separate my fields in the context text returned. Any suggestions for making it easier for gpt-3.5-turbo to recognize these fields?

BTW, gpt-4 doesn’t have a problem recognizing the title field.

1 Like

Well, what do you know? I posed the question to GPT-4.

Formatting the input in a structured manner can help the model better understand the data it’s processing. You can try adding extra line breaks and colon characters to clearly separate the fields. Here’s an example:

Question: What documents are available with the title “Drupal AI SolrAI - CSS”? Number of context documents: 2

-----
Context document 1:
Document Title: Drupal AI SolrAI - CSS
Content: blah, blah…
Event Date: 
Document Groups: Documentation, Development, D9 Development, AI
Document Taxonomy/Tags: 
Source Document Summary: blah, blah…
URL: https://master1and1-9.schoolboard.net/node/1445
-----

Context document 2:
Document Title: AI: Drupal SolrAI Module
Content: blah, blah…
Event Date: 
Document Groups: Documentation, Development, D9 Development, AI
Document Taxonomy/Tags: AI ToDo
Source Document Summary: blah, blah…
URL: https://master1and1-9.schoolboard.net/node/1402
-----

Each field in the context document is now on its own line, which could make it easier for the model to identify them. The use of “-----” to separate the documents remains the same.

I made that change, and now this is what I get:

It’s not perfect, but WAY better than what I was getting before. Thank you GPT-4! Again!

4 Likes

Thanks for your reply in the other thread! Very clear.
As a beginner in AI this is really helpfull.

Do you think you could have used the same technic as in your video GPT-4 & LangChain Tutorial: How to Chat With A 56-Page PDF Document (w/Pinecone)?

Using the same technic, let’s say each “context document” is a chunk.
Will it be possible to answer queries like: “What documents are in the Development group?” or “What documents have the ToDo tag but not the AI tag?” ?

My fear is that the vector comparison step will return too many chunks and too many relevant docs will be send to chatGPT.

You have to sum the number of context tokens and limit it. There are some openai examples using tiktoken for it.

1 Like

Unknown. You would have to test it out with your own documents. In my case, I am using Weaviate as my vector store and I have a property titled “groups” in the metadata that is also vectorized. “Show me documents in the public group.” so far fails to list documents with groups property which either contain or equals “Public”.

Your case could be different.

1 Like

this appears to be the more complete solution, at least in the case I tested with gpt-3.5-turbo-16k: API Prompt for gpt-3.5-turbo-16k - #12 by SomebodySysop

Until I started sending the documents in XML format, gpt-3.5-turbo-16k would pretty consistently not recognize them if a) there were lots or b) the text was fairly complex. No such problems with gpt-4 or gpt-4-turbo.