I work for an org that has a huge corpus of reports (1000s). Frequently, we are expected to add a small amount of new material, and contextualize it using past reports, for example re-asserting things we previously said.
I would be happy to create a fine-tuned model to suit this purpose, but I’m struggling a bit to conceptualize the structure. It seems that GPT-3 uses a “one prompt, one response” approach. How then to link a prompt to several reports, to then create some kind of amalgamated output?
The other thing I’m thinking is semantic search… but that would only generate text near the embedding space, right?
Hi, there are several people in this forum focussed on search and question answering based on a well-defined corpus of materials. I think a common denominator is that you need to break up your reports into chunks, each chunk not to exceed the token limits. You then need to put the chunks into a json lines file and upload the file. For my use case, I’ve had to ensure that the breaks between the chunks are semantically meaningful (as opposed to simply programmatically splitting the corpus into n chunks of equal length). From your description, it sounds like you might have to do something similar, i.e., manually assess where the chunks should begin and end. It’s unclear from your description whether you’ll only need to search your materials - if so, it’s a fairly straightforward solution using the embeddings endpoint. If you need to get answers to questions about your materials, and/or generate the text to modify your materials, the solution will be more complex. I am doing search and Q&A so I’m using the embeddings endpoint for the search part and davinci-002 for the Q&A part. Somewhat counter-intuitively, I haven’t found fine-tuning to be helpful for my use case which involves a highly technical, specialized corpus. I have a theory about when fine-tuning is helpful and when it’s not, but I won’t bore you with that. Instead, I’ve found that prompt engineering is much more important and helpful for generating the outputs that I want. It’s really quite amazing how different the results can be with seemingly very subtle changes in the prompt. As a side point, it would be great if all the people in this forum who are focussed on using GPT-3 with a defined corpus of their own materials could get together and brainstorm and share solutions. We could probably come up with some good feature requests for OpenAI. I hope the above is helpful.
@lmccallum Thanks, that’s really helpful. But now I’m a bit confused… how can I submit a corpus for default davinci to use? That sounds like it would solve many problems.
For Q&A or text generation, you have two choices: (1) fine tune on your corpus or (2) provide prompts that either (a) include sample outputs to demonstrate what you want or (b) are well-engineered to obtain what you want. It’s probably a good idea to read OpenAI’s documentation cover to cover - that really helped me and it will save you a lot of false starts. I use approach (2)(b) for my Q&A, for the reasons noted in my earlier message.
I’m interested in this use case as well. Perhaps like the original asker, a Q&A function is not sensible for my case. The reason is that most questions will be idiosyncratic / one-off. Also, I will not be able to anticipate what people need to know from the material. Rather, I only know that they need the material to perform their jobs. Given this, what are our best options for fine tuning?
Thanks
PS - @lmccallum - I would be very interested to know your theory on when fine tunings are valuable or not!
@lmccallum Thanks for your further clarification, sorry for all the questions, I’m just trying to catch up! So your approach for Q&A is 2a, so you are providing prompts that are well-engineered… Does that mean that you’re choosing (manually or using some crazy NLP skills) specific sentences from your corpus and adding that to questions coming in for each question?
Sorry, I also don’t know if I’m asking for your “secret sauce” and you prefer to not be too detailed.
I guess as far as I can tell, there’s two ways to ‘provide my corpus’, 1) is uploading a file and pre-training, or 2) use something like this, in which I’m actually not using the API to run queries, I’m just using the API to process my text and then running the model locally.
I did read the manual cover-to-cover, but it’s highly probable that I missed something… is there another way to use your own data?
I’m doing something very similar to what’s in that cookbook file. I’m using cosine similarity to find the best n matches to the embedded query from among the embeddings in my corprus (that have already been uploaded to OpenAI). Then I’m sending the best matches to the completion endpoint along with instructions. I’ll be seeking approval to go live in the next week or so, then when the functionality is on my website I’ll send you a link.
Hi Ro. I originally thought that fine-tuning meant teaching GPT-3 about my own corpus. Since GPT-3 was trained on the internet, it is much better at completions involving topics that are voluminous on the internet. So, if we ask GPT-3 something like “What is the most populous country on earth?” we are much more likely to get a high quality answer, compared to asking GPT-3 something like “Do SEC rules require companies to disclose their Scope 3 GHG emissions?” I thought that if I provided GPT-3 with my corpus of legal rules (which are publicly available on the internet), it would be able to answer questions about those rules as well as it can answer questions about more common topics. Unfortunately, I found that fine-tuning didn’t really help GPT-3 understand my topic. Instead, it only helped with matching the writing style, the tone of voice, the ability to speak in a convoluted way like the SEC rules are written. But the actual quality and accuracy of the answers was no better than without fine-tuning. To be fair, I didn’t do very much experimentation with fine-tuning so there may be some functionality that I wasn’t taking advantage of. I’d love to hear others thoughts on that. But my initial experiment was poor enough that I decided it made more sense for my use case to focus on prompt engineering instead of fine-tuning. I hope that’s helpful.
Thanks @lmccallum for the deeper dive. Sadly, that is very similar to the kind of use case I envisioned, in which there is a complex corpus of information that is technical and professional. I was hoping as well that GPT-3 could ingest it and offer some informed guidance. It sounds like you uploaded the documents, but that the outputs were no more informative than before.
How did you go about uploading the information, given the required “prompt” and “completion” format needed for the JSONL files? Or, if this question is nonsensical, please help me understand better how the upload was done. I’m still wrapping my head around things.
And, I’m interested to experience what you’ve built once it is live!
Thanks
There are instructions in the documentation on uploading. There is a method where you can leave the prompt blank and provide all of your corpus as completions - just to show your body of text. That’s what I did, as I can’t create prompt-completion pairs. If you can create prompt-completion pairs that reflect what you want your system to achieve, then that training could certainly be helpful.
Just throwing 2 cents in here. I don’t have a corpus of a thousand reports, I just have a few pages of text. With that said, I don’t think you need to upload anything. Create a database to hold all of your chunks (paragraphs?) and their embeddings. Then generate embeddings for your queries, and compare.
Then you feed those chunks back to gpt3, perhaps for summarization first, if you have a lot of them. Then prompt with something like
answer questions based on this text:
<insert your text here>
Q:why did the chicken cross the road?
Or something like that. But anyway, I guess my point is that if you are searching and answering questions, you are better off with your data stored locally.
Thanks @lmccallum. I am in the boat where there are no obvious prompt-completion pairs to be made, rather the generalized need for content expertise across a very large corpus.
That said, would be great to see what you have built when it is available!
Thanks