I am trying to create a kind of Support Bot to answer my clients about specific technical details about WordPress plugins that I sell.
The goal is that the /completitions api would be feeded a prompt which could be something general like a CSS styling change which the davinci engine knows without any specific data about my business, but the customer might ask something specific for which I have about a data set of 3000 questions and answers ( prompts / completions ? input / output ? ) on which the bot can feed on. - exactly like this awesome example here
I am a web developer and I don’t have experience with AI. I am just scratching the surface trying to put this bot together learning concepts like Machine Learning, Training Data, Validation Set, Plotting, Neural Network. So bear with me cause it’s a lot to grasp
So first of all, I did a lot of documenting and getting an API key from OPENAI is certainly the first step.
Then I told ChatGPT my story and what I tried to achieve, I asked him to write in PHP preferably , but it always ends up hallucinating so could not really use anything he generated without adjusting. And the further I asked him about specifics, the further he hallucinated.
So I read a lot of documentation and extrapolating with what I got from ChatGPT I think there are 3 ways to achieve this:
- fine tuned model
- uploading a training set and a validation set
- embeddings api ( which the example that I linked uses )
Understanding that most examples are on Python, I started mlq.ai/gpt-3-fine-tuning-key-concepts/ tutorial - then I prepared with
fine-tunes.prepare DATA_UNDER_COMMENT into the json by line to be categorised in prompt and completions
Then, I used
openai api fine_tunes.create -t to create my fine-tune
and now I have my fine-tune created and I run
openai.Completion.create( model=FINE_TUNED_MODEL, prompt=YOUR_PROMPT)
this looked like the way to go, but even if put a basic question that was actually in the JSONL, it’s like the engine forgot to talk and outputs random characters
so I tried another approach from the cookbook which seems pretty great - following this which seems exactly like what I want to achieve .
Blockquote The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.
I tried to use the code there, but with my CSV hosted online I got a 406 response when trying to load
Then I stored the CSV locally, and it complained that a column ( tokens ) was not available for converting to int ( 10 )
Then from what I could understand I switched from using load_embeddings to compute_doc_embeddings because it says from the documentation that they already have the embedding generated for that CSV
I did that, but now it asks for a json instated of CSV
Of course, I am able to provide my data in any format, so when I tried to load my data it says that the token limit of 8000 is exceeded for this request
I now try to input a small json here - under a comment
and try to run a prompt. And, kind of amazing, after hours of work, it seems to work. I provide a question from the data, but under a different structure and it replies to me correctly, using a different wording than the one from the json data.
he could have not known this from the general knowledge.
so this is what I want to achieve, but my data set is much larger.
I need help to understand if my approach is correct . And if Embedding is the way to go , how to feed data into OpenAI and reference that embeddings set when doing api calls to completions. Ideally I would have those embeddings stored somehow with the possibility to add to them. Just like I have fine tune sets or files under my API account.