I am using chatGPT 4.0 Turbo to extract information from legal contract. The input contract is in Japanese with about 23k tokens.
My issue is with the same prompt and input contract, the extraction results are different in different runs. I set temperature = 0 and even I set top_p = 0.1, the results are still different.
You can set top_p: 0.000001 - but still you will get variation in the highly-uncertain results of this task, because the AI model itself is not deterministic.
If you ask for reproduction of a section of a document like yours, consider: the AI has 20000 tokens where it might start reproducing, and uses attention masking where it can’t value or perceive it all at once.
Get the logprobs, and you will likely see that where the extraction starts, you have many possibilities from the document, all with low certainty.
You may want to reduce the input you provide, by using an embeddings model vector database for automatically placing chunks of the document, or a search function that can operate on the document. Even add a table-of-contents for the function that provides bookmarks. If you can inject the same reduced set of 1000 tokens of document each time, you will have an AI that will be more reliable.
The ultimate judgement for you to perform: is the AI correct or incorrect?
OpenAI embeddings models are also non-deterministic, where top results of a semantic search have the potential of switching positions, but then you are searching on 100 chunks instead of 20000 tokens of uncertainty. OpenAI also is not the only provider of embeddings.
Ultimately it is about quality, not necessarily producing the same thing every time. The user input may not give you as high of a focus on the correct RAG result as were you to pass an entire document into the AI model context.
Embeddings to reduce AI input will also reduce your AI language model costs.