How to feed data for completions, instead of using prompt/answer fine-tuning format?

I’d like to feed the model data that it can ingest and then use for responses

Stuff like articles, documentation, etc, where there is no prompt/optimal answer format

Is this possibler ?
Thanks !

2 Likes

Yes, the available method is called “fine-tuning” and the OpenAI API docs covers this.

HTH

1 Like

Yes i saw that

create-training_file

fine-tuning

But it uses the prompt/optimal answer format

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

So the question is, how to i say: take this article, and just add it to your knowledge base

There is no question or optimal answer for the article

2 Likes

You must format a file in JSONL with the data you wish to train via fine tuning.

Yes, that is the current way to do it. :slight_smile:

You cannot just “take articles” and “add them to the OpenAI network”. The only feature currently available is to use the fine tuning method you don’t like :slight_smile:

4 Likes

Ok, so i have 3000 characters that represent the article, and i have zero characters that represent an optimal reply

… where do i put what ?

ChatGTP has general knowledge, i can ask it questions about a film, but i can not ask it question about internal company documents

I just want to feed it generic data, like an article, a wikipedia page.

So how would that look like in what you explain ?

3 Likes

Same story for chatGTP, imagine it has a cutoff in 2021, so i can ask it questions about some film from 2010, because it was fed the wikipedia and other pages talking about that film

But now i want to ask it for a summary of a film from 2023 …well, there has to be a way to feed it that new data from 2023

And IMO this is definately not tuning in the sense of

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...
2 Likes

I haven’t done it so take this with a grain of salt, but-

If you have three articles (Safety Regulations, HR Policy, Mission Statement), you send each to the Embeddings service, and receive back a mathematical characterization of each.
Safety Regulations [0.01024, 0.0551, …]
HR Policy [0.028590, 0.89784, 0.9845, …]
Mission Statement [0.847, …]

Then, you take your query and get the embedding of THAT.
“are t-shirts against company policy?” [0.98945, 0.402, …]

You do some math between the query’s embedding, and each article’s embedding, yielding three similarity scores.
Query-Safety regulations = 0.5
Query-HR Policy=0.7
Query-Mission statement=0.65
Doing the math is not hard, just ask ChatGPT how to compute the cosine similarity

Take the highest/most similar score, and paste that article’s text into the completion service along with the prompt text
“Info-A productive workplace is important. Professionalism is expected at all times. Proper attire should be worn at all times. Question-Are t-shirts allowed? Answer-”
and, it’ll probably give you some kind of generated response telling you no, t-shirts are not allowed.

In reality, whole articles are probably too coarse-grained and too long, so maybe a recursive process to find the most similar articles, then the most similar paragraphs within those, and maybe the most similar sentences, so when you pass the document texts, you’re not also passing tokens about wet floors in the workplace. But also, maybe the mission statement has something about a relaxed atmosphere, so I wouldn’t pull from just one article

7 Likes

Finally some insight, thanks :+1:

Basically, if i understand, we currently need a database on our side

  1. Generate embeddings from pdfs, and whatnot, and save to a local database
  2. Generate embedding from question, and do a query against local database, and return original content
  3. Use original content as input to pre-feed prompt & add original question
  4. get result, and pay mucho tokens ^^

I hope OpenAI integrates something like this on their server backend. We need a way to simply add our content to be indexed by GTP.

Once companies can simply have a marketing or documentation team feed the GTP database, and have another team build a frontend that is used internally as a AI Assistant, then this will go to the moon :slight_smile:

Imagine simply feeding it your whole website archive, and then have a search assistant on your site. This would rock on sites that rely heavily on reading materials. QA, marketing, education, …


Here is a simple use-case for the company i work for

They do a lot of collaboration with Airbus industries, much to do with safety regulations, electrical standards, assembly/supply chain management., and training courses/consultancies in these matters

So we have domain experts & instructors (airbus has Service Now as an internal tool)

Now if i could just feed GTP all kinds of technical docs, standards, etc, that would make for a great interactive wikipedia

4 Likes

There are vector search databases which support embeddings.
Comparing a large number of embeddings is not something which can be done easily.

A few of these databases have OpenAI integrated in their configuration, to make it simple for you.

I prefer weaviate

What you said about having a AI assistant from your own knowledge base it is possible, but is limited.
If you try ChatGPT, it can combine unrelated pieces of knowledge, but it won’t do the same with your knowledge base.

2 Likes

Looks pretty cool, but still too technical given the small amount of time i have

My company, as is Airbus, are old dinosaurs, they barely have an IT team, and delegate whats left to stuff like box, google apps, service now, etc…

However if you can meet their needs, they (airbus) have plenty of cash to flip around :rofl:

1 Like

The easiest way I found so far is to use a Python library called GPTIndex
You use OpenAI via it and it will automatically generate embeddings from a repository of text documents. They also recently added PDF support and they support Google Drive and Slack if I remember well.
Once it has its embeddings, you can create an index from it and send queries to that index for question answering use case.
The whole things can be done in only a few lines of codes.

6 Likes

will definitely have to have a peek at that one, thanks :+1:

1 Like

It is possible to see ChatGPT integrated with Office 360 and enabled just like that.
I can see a feature within Microsoft Word where the user can chat with a bot and receive responses from the files previously uploaded.

1 Like

Airbus just switched all their stuff from microsoft to google apps …their loss :stuck_out_tongue:

The google stuff is actually pretty good, but they are now locked into app script for basic manipulation, and that ecosystem sucks as of today compared to vs code & co…

1 Like

So I jump-in to this thread to ask another related question.

Let’s say my data is not a document, it is build from a different objects, For example I have 1000 objects from type A and each A object may have the properties: x,y,z.

I would like to generate from this type of data a jsonl data set to fine-tune my own model and use the OpenAI API to ask a questions about this data. for instance, I would like to ask the something like: show me all A objects that x>3.

How should I create the dataset?
Anyway, It is the right way to do it?

1 Like

Strictly on your example, I would proceed differently.
I would make the object A semantic first and use embeddings instead of fine-tuning.
Basically the object would be described in natural language.

When the user would prompt, the closest embeddings would be identified.
Than GPT would execute your request on the embeddings selected.

In practice this solution may not be feasible, but hopefully it’s a start.

1 Like

I tried this approach.

It’s not good enough, because to answer it, you need to provide context, and if I have a lot of objects related to the query (for instance how many…) it won’t be good enough.

1 Like

I agree with the points above about using embeddings to thin down relevant context as input, instead of finetuning.

I like GPT-index, having discovered it recently, but have my own purposes more need for it to understand different parts of documents accurately and in particular work well for varied PDFs and videos so I run a commercial service (powered by OpenAI) to automate this task: www.fragen.co.uk - I began work on this in August as I’ve mentioned on reddit.

Key advantages compared to rolling your own GPT index include:

  • Better semantic text due to proper OCR+extras processing = Better embeddings = better results
  • Deliberate source document re-rank to provide better GPT response
  • Ranks sentences in answer by confidence in how grounded they are (green-red), shows all source document pages/quotes with answers. This uses tech from the last 2 years at Revision.ai (Reduces untruthful statements by about 3x)
3 Likes

The other bonus with embeddings that isn’t mentioned, is that you can update your facts (embeddings) instantly without an update to a fine-tune. So as your facts roll-in or get refined over time, it is nearly instantaneous and relatively painless to update (and zero cost).

I agree with @i-technology in that as soon as OpenAI can get this automated for the broader audience, it will go viral. Just like prompting GPT-3 with the history of conversations == ChatGPT == Viral. In the meantime, those with time and computing on their side will revel alone in this secret.

3 Likes

Hello, I’m finding this thread very useful, which approach worked out for you in the end? have there been any innovations since?