Working on making a complex document able to be interacted with. It has legal terminology and also has its own definitions for some words or phrases used in the document which can differ from the common language. The other annoying thing is that it is in parts, and these parts affect other parts of the document. The parts that do affect another part of the document mention so, but the affected part doesn’t reference that affecting part.
I want people to be able to ask questions or advice and get an answer based on the document itself in the language and way I need it to respond in.
I want to further be able to use this afterwards, or at least have it be able to be used, to automate some tasks for me.
Should I reword everything and try to embed or vector it where the model might have sections that are 6000 tokens long, or should I fine tune a model and then template it across for other similar documents and just adjust the responses?
I’m really stuck here on this. I know cost is a thing to take into consideration, but I’m trying to work out the best way to do this for accuracy, and then I can justify the costs afterwards.
I don’t think fine tuning is the solution here, this sounds like a task for embedding and perhaps metadata and overlapping included, if you have a list of which parts affect which others in machine readable form then you could build up a meta linking list to append with each embed segment, you can also overlap 30% of the prior text with 40% new content and then append another 30% of the next segment to that, so in essence you reduce your segment content by 60% which is then used to create a bridge between what came before and what comes after. If you do that along with meta data about what effects what, you could have quite a comprehensive database to interrogate.
Now is this still going to be intractable and can the language in which it responds be modified?
Example:
By adhering to this policy you ensure that all information is true and accurate. Any omissions or errors on your part, or another parties, do not exclude you from these duties.
What I want it to respond with:
As long as you give accurate information then that’s fine.
Is this possible? I need it to speak customer service language, not legal language.
Show it a typical interaction with roles and what would happen typically, put that in a marked sections with say ### tags around the start and finish and then say “Use the example in ### markers, and produce results which fit this interaction format, you should be friendly and personable, do not use hard to understand or legalistic terms”
Alright I’ll give it a go with a small part. But I’m just trying to see if a fine tune might be worth it in the end to be honest as this model will actually scale into several other departments internally with a slightly different role in each.
Understood, it all depends what you are expecting out of the fine-tune. If you intend to teach it how to answer questions and in what style, it could be very useful, but it won’t add any “what” to that, i.e., it won’t add facts or data, but it will add ways of doing.
If your problem statement is “the document sets out new rules, that are followed in the rest of the document, and these rules may be different for each document,” then current GPT models are not for you.
The current GPT models will predict outputs based on their training, given the pre-prompt. This doesn’t mean that they will follow all instructions. If a layman on the street who likes making stuff up about everything were to write a completion to your prompt, with the goal of convincing the reader that they knew the answer, the output would be of about the same quality as what you will get from these models.
You could use the embeddings for the data retrieval and then use prompt engineering to instruct the model how to respond , one effective way to do this is with a “shot”, which is an example response, in which you show the model how you want it to act and it will do it’s best to follow that template. Typically you put the example inside a pair of markers (### your_example ###) and then tell the model to respond like the example but with data relevant to the new query,
Is this with embeddings? I have a similar question here: I read about embeddings and I want to try it. How to start?. I have a document I want the bot use for questions. I tried it with fine-tuneing but the answers were very poor quality. Should I use embeddings? What is the shot? Do you mean to put the example in markers into the prompt followed by the real questions?
I put this \n\n###\n\n as marker. Is it right? A more simple training data would be:
{“prompt”: XXX, “completion”: XXX} but I wanted to add as much data as possible and there is an similar example on openai website: OpenAI Platform
From what I have read, fine tuning will resolve the style of question and answer format / structure you are after (the rules you mentioned) but it will NOT provide you with accurate answers due to fine tuning NOT necessarily being great at information retention and prone to hallucination. I concur with the other comment in which retrieval augmentation (embeddings with a vector DB) with some overlap of text and then if that is not sufficient I would investigate Knowledge Graphs as an additional component…for maintaining the semantic “links” between related areas…