data
1
I work for an org that has a huge corpus of reports (1000s). Frequently, we are expected to add a small amount of new material, and contextualize it using past reports, for example re-asserting things we previously said.
I would be happy to create a fine-tuned model to suit this purpose, but Iām struggling a bit to conceptualize the structure. It seems that GPT-3 uses a āone prompt, one responseā approach. How then to link a prompt to several reports, to then create some kind of amalgamated output?
The other thing Iām thinking is semantic search⦠but that would only generate text near the embedding space, right?
How can I structure this?
1 Like
Hi, there are several people in this forum focussed on search and question answering based on a well-defined corpus of materials. I think a common denominator is that you need to break up your reports into chunks, each chunk not to exceed the token limits. You then need to put the chunks into a json lines file and upload the file. For my use case, Iāve had to ensure that the breaks between the chunks are semantically meaningful (as opposed to simply programmatically splitting the corpus into n chunks of equal length). From your description, it sounds like you might have to do something similar, i.e., manually assess where the chunks should begin and end. Itās unclear from your description whether youāll only need to search your materials - if so, itās a fairly straightforward solution using the embeddings endpoint. If you need to get answers to questions about your materials, and/or generate the text to modify your materials, the solution will be more complex. I am doing search and Q&A so Iām using the embeddings endpoint for the search part and davinci-002 for the Q&A part. Somewhat counter-intuitively, I havenāt found fine-tuning to be helpful for my use case which involves a highly technical, specialized corpus. I have a theory about when fine-tuning is helpful and when itās not, but I wonāt bore you with that. Instead, Iāve found that prompt engineering is much more important and helpful for generating the outputs that I want. Itās really quite amazing how different the results can be with seemingly very subtle changes in the prompt. As a side point, it would be great if all the people in this forum who are focussed on using GPT-3 with a defined corpus of their own materials could get together and brainstorm and share solutions. We could probably come up with some good feature requests for OpenAI. I hope the above is helpful.
2 Likes
data
3
@lmccallum Thanks, thatās really helpful. But now Iām a bit confused⦠how can I submit a corpus for default davinci to use? That sounds like it would solve many problems.
For Q&A or text generation, you have two choices: (1) fine tune on your corpus or (2) provide prompts that either (a) include sample outputs to demonstrate what you want or (b) are well-engineered to obtain what you want. Itās probably a good idea to read OpenAIās documentation cover to cover - that really helped me and it will save you a lot of false starts. I use approach (2)(b) for my Q&A, for the reasons noted in my earlier message.
Ro
5
Iām interested in this use case as well. Perhaps like the original asker, a Q&A function is not sensible for my case. The reason is that most questions will be idiosyncratic / one-off. Also, I will not be able to anticipate what people need to know from the material. Rather, I only know that they need the material to perform their jobs. Given this, what are our best options for fine tuning?
Thanks
PS - @lmccallum - I would be very interested to know your theory on when fine tunings are valuable or not!
data
6
@lmccallum Thanks for your further clarification, sorry for all the questions, Iām just trying to catch up! So your approach for Q&A is 2a, so you are providing prompts that are well-engineered⦠Does that mean that youāre choosing (manually or using some crazy NLP skills) specific sentences from your corpus and adding that to questions coming in for each question?
Sorry, I also donāt know if Iām asking for your āsecret sauceā and you prefer to not be too detailed.
I guess as far as I can tell, thereās two ways to āprovide my corpusā, 1) is uploading a file and pre-training, or 2) use something like this, in which Iām actually not using the API to run queries, Iām just using the API to process my text and then running the model locally.
I did read the manual cover-to-cover, but itās highly probable that I missed something⦠is there another way to use your own data?
Iām doing something very similar to whatās in that cookbook file. Iām using cosine similarity to find the best n matches to the embedded query from among the embeddings in my corprus (that have already been uploaded to OpenAI). Then Iām sending the best matches to the completion endpoint along with instructions. Iāll be seeking approval to go live in the next week or so, then when the functionality is on my website Iāll send you a link.
1 Like
Hi Ro. I originally thought that fine-tuning meant teaching GPT-3 about my own corpus. Since GPT-3 was trained on the internet, it is much better at completions involving topics that are voluminous on the internet. So, if we ask GPT-3 something like āWhat is the most populous country on earth?ā we are much more likely to get a high quality answer, compared to asking GPT-3 something like āDo SEC rules require companies to disclose their Scope 3 GHG emissions?ā I thought that if I provided GPT-3 with my corpus of legal rules (which are publicly available on the internet), it would be able to answer questions about those rules as well as it can answer questions about more common topics. Unfortunately, I found that fine-tuning didnāt really help GPT-3 understand my topic. Instead, it only helped with matching the writing style, the tone of voice, the ability to speak in a convoluted way like the SEC rules are written. But the actual quality and accuracy of the answers was no better than without fine-tuning. To be fair, I didnāt do very much experimentation with fine-tuning so there may be some functionality that I wasnāt taking advantage of. Iād love to hear others thoughts on that. But my initial experiment was poor enough that I decided it made more sense for my use case to focus on prompt engineering instead of fine-tuning. I hope thatās helpful.
1 Like
data
9
Thanks @lmccallum , your usecase is really clever.
Look forward to your website.
Ro
10
Thanks @lmccallum for the deeper dive. Sadly, that is very similar to the kind of use case I envisioned, in which there is a complex corpus of information that is technical and professional. I was hoping as well that GPT-3 could ingest it and offer some informed guidance. It sounds like you uploaded the documents, but that the outputs were no more informative than before.
How did you go about uploading the information, given the required āpromptā and ācompletionā format needed for the JSONL files? Or, if this question is nonsensical, please help me understand better how the upload was done. Iām still wrapping my head around things.
And, Iām interested to experience what youāve built once it is live!
Thanks 
There are instructions in the documentation on uploading. There is a method where you can leave the prompt blank and provide all of your corpus as completions - just to show your body of text. Thatās what I did, as I canāt create prompt-completion pairs. If you can create prompt-completion pairs that reflect what you want your system to achieve, then that training could certainly be helpful.
1 Like
mike3
12
Just throwing 2 cents in here. I donāt have a corpus of a thousand reports, I just have a few pages of text. With that said, I donāt think you need to upload anything. Create a database to hold all of your chunks (paragraphs?) and their embeddings. Then generate embeddings for your queries, and compare.
Then you feed those chunks back to gpt3, perhaps for summarization first, if you have a lot of them. Then prompt with something like
answer questions based on this text:
<insert your text here>
Q:why did the chicken cross the road?
Or something like that. But anyway, I guess my point is that if you are searching and answering questions, you are better off with your data stored locally.
Ro
13
Thanks @lmccallum. I am in the boat where there are no obvious prompt-completion pairs to be made, rather the generalized need for content expertise across a very large corpus.
That said, would be great to see what you have built when it is available!
Thanks