I would like to load text from a book to fine-tune my ChatGPT to be more specific and helpful. That’s a lot of text to go sentence for sentence to create prompts and completions. Is there a shortcut?
When you say you’d like to fine-tune your ChatGPT, do you mean that you’d like to train the model and ask questions related to the book to get more accurate responses? If so, have you considered exploring the use of embeddings instead?
Yes, you are correct. That’s exactly what I mean. Thank you for articulating it for me. To give a more specific use case. I would like to integrate ChatGPT into my chatbot, which focuses on a specific topic of course.
No I have not, I’ll look into use of embeddings. Do you elaborating on why embeddings may be better?
You should try to have something custom made for yourself USING chatgpt api
ChatGPT is a quick and flexible language model. You don’t need to fine-tune or embed it unless you have data that ChatGPT is not familiar with. Those who have more experience with language models know how difficult it was to get them to work, But with ChatGPT, you no longer need a few examples to get it working. Prompting is easily compare to other models.
As far as I know, you can’t fine-tune ChatGPT, but you can fine-tune basic models like Curie and Davinci. If you need to find answers on your own dataset, then you should consider embedding.
I suggest read this document in Openai’s github page : openai-cookbook/techniques_to_improve_reliability.md at main · openai/openai-cookbook · GitHub
I also have small small guide on my github page for prompting that might be helpful
I am also intersted in training my model with texts. So if I am not getting anything wrong, we still cannot fine-tune the gpt-3.5-turbo, and one possible way to achieve the goal is using document embeddings like openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook · GitHub ?
Can I share my two cents?
The model has been trained on literally everything that is freely available on the internet, and does a great job at one-shot answers. Even a better job at well engineered prompts. Unless the information you want to use for Fine Tuning is not already in the open, I don’t believe it’s worth the effort and better to try and get what you want to achieve by making/engineering better prompts. For what it’s worth. Always open for any suggestions/comments/additions…
What is your final purpose? ask open questions about the content of the book? or something else?
I fine tuned davinci with a 70K words book. I split the book in sentences for a total of around 3000 sentences. When tested against arbitrary questions, but all in the context of the book, I didn’t get the expected results as it would diverge to content outside the book.
Then I went the ‘embeddings’ route with the same book. This time I got exactly what I wanted. I could ask the model any question about the book and it would answer perfectly most of the times (95%+ if to put a number).
Thank you for your response. I would like to leverage the an OpenAI model to create a chatbot similar to ChatGpt that specializes a specific domain within mental health that is more recent than ChatGpt’s current knowledge base which set in 2021.
Hey Man, Can you link the code for the “embeddings” method
@juan_olano thanks man. I will take a look this weekend!
I’ve built this exact functionality using embeddings to search for the right passages from books (with HyDE) and then davinci to generate the output.
See my post on twitter for more info. Happy to help if you want to build something similar: https://twitter.com/naz_io/status/1647990346024988673
Hello, I’m medical oncologist in process of training the model for a very specific topic (non-small cell lung cancer egfr mutated), but I need to include 2022 and 2023 Q1 pivotal articles. Some recommendation?
If you’re looking into a place where to chat with documents, we’ve built Sharly - we tested on Harry Potter’s books and it works pretty well.
On your request about prompt - are you looking into creating and saving yours or how?
If I already have a large labeled QA dataset, how to leverage it to further enhance the performance based on Question answering using embeddings-based search?