How to structure the embeddings?

porton.victor · April 29, 2023, 4:52pm

Fine-tuning training is clear: question-answer, question-answer.

But how to best structure embeddings? Should an embedding be also a question-answer? Should it be like a howto article?

My purpose is to put my opinions on a certain topic into a model, so providing the customers automated philosophy discussion “with me”.

SomebodySysop · April 29, 2023, 5:50pm

A great explanation of vector embeddings: Brainstorming ChatGPT Business Ideas With A Billionaire (#438) - YouTube

Text embeddings and semantic search: Text embeddings & semantic search - YouTube

curt.kennedy · April 29, 2023, 6:41pm

There are many ways to do it. But to be optimal, you want the embedding of the input to lock-on to the most relevant content in your vector database.

So if you expect someone to ask “What color is the sky?”, you should have embedded a similar question in your vector database. Then there is the signal to noise question (SNR). You could embed on your end only the question: “What color is the sky?” or you could embed the question and answer pair “Q: What color is the sky? A: Blue”. The latter embedding with the answer “Blue” will have a diluted correlation compared to the original question. This goes for anything where the answer is de-correlated with the question.

So if you embed question only, and embed and correlate question-to-question, you’ve increased the SNR! But your backend database will have write out both the question and answer to feed to your prompt (so more complexity), vs. the lower complexity Q/A composite embedding (but lower SNR).

But you also mention “philosophy”. Most philosophical writings aren’t question/answer pairs. They are writings organized around different topics and thoughts. In this case, you should just embed chunks of this data, and hope or expect the question is correlated to your writings and have GPT draw from these correlated writings for the answer.

So, in a nutshell, to have high SNR correlation on questions, just embed the questions, and use the database to form the Q/A pairs for the prompt to draw from (high SNR, but more complexity), OR composite Q/A pairs for lower complexity and lower SNR.

And if you have a big corpus of text that isn’t structured this way (as Q/A pairs) then embed this data directly in chunks, and let GPT draw directly from this data for the answer.

There are obviously tons of permutations as well. For example, have an internal “AI Agent” ruminate over the question to derive “the real question” (reframing) and then draw on your embeddings for the answer. This could add some entropy to the system and give fun and unexpected results, especially in a philosophical debate bot!

SomebodySysop · April 29, 2023, 9:01pm

A strategy I used to achieve this was to submit each chunk of my text in a prompt completion which simply prompts: “Please generate (x) questions that this text answers.” Then include the questions in the chunks to to be embedded. It’s an extra step, and consumes more tokens, but is an automated approach to achieve SNR.

curt.kennedy · April 29, 2023, 9:03pm

That’s a great idea! Basically cast a wider net on the input before correlation.

N2U · April 29, 2023, 9:34pm

I’m just wondering, given the example

Would it be possible to reduce the SNR by embedding the answers only? in this case something like “The sky is blue.” to increase the correlation between correlation between “sky” and “blue”?

curt.kennedy · April 29, 2023, 9:38pm

Yes that could work, maybe at a reduced SNR. This is equivalent to my big corpus of unstructured text.

Which, I’m guessing will reduce the SNR somewhat, but totally viable!

N2U · April 29, 2023, 9:56pm

Interesting, what I’m thinking is, that you can use GPT to increase the SNR and correlation before embedding. If i use the prompt “Generate very semantically compressed factual statements based on the following:” on the history section of the machine learning wiki i get the following response:

Arthur Samuel coined “machine learning” in 1959, and by the 1960s, Raytheon Company developed CyberTron, an experimental learning machine. Interest in pattern recognition grew in the 1970s and 1980s. Tom M. Mitchell’s definition of machine learning emphasizes learning from experience and improving performance, aligning with Turing’s question, “Can machines do what we can do?”…

Compared to the original wiki, it seems to capture “the important stuff”

curt.kennedy · April 29, 2023, 10:05pm

You could also do all the above.

Example:
(1) Embed the original incoming text, (2) embed the GPT expansion. Correlate both and gather a list on top hits for both. So List1 = Correlation on original, List2 = correlation on GPT expanded. Merge these two lists together, and take the embeddings based on the top correlations in the merged list.

This way you are giving the original input a shot at achieving the highest correlation, and using the GPT expansion as a backup.

N2U · April 29, 2023, 10:16pm

That seems like a really good idea

porton.victor · April 29, 2023, 10:17pm

For me it looks like a bad idea:

If a user’s question will correlate with a question in our embedding, then the AI would ignore the answer part and therefore base the answer only on the base model knowledge, ignoring that we have an answer.

Isn’t it so?

curt.kennedy · April 29, 2023, 10:22pm

Remember the embeddings all correlate and map back to YOUR DATA!

So all this is trying to do is smooth out the interface between <Random Question> and <Company Approved Answer>.

A big problem is this lack of correlation of the question (or input) and your data. And you don’t want a “weak question” to correlate with off-topic data either. You can also tell GPT to respond with “I don’t know” if the answer doesn’t lie within the data that was pulled from the embedding correlations.

Not if you fed the answer back to the prompt!

Example input: “What color sky!”

Your database:
“What color is the sky?”
“Is the earth flat?”
“Where is Waldo?”

The correlation locks onto the first one. So you look this up, and also the answer in your database (in another field), and put this in the prompt:

Use the following context to answer the question:

Q: What color is the sky?
A: Blue

Q: What color sky!
A:

########

The Q/A pair was from your embedded data, and the original question is fed back in and GPT will reference your data to answer it!

e0792466 · April 30, 2023, 12:12am

I have the same problem!
I want it to deploy some human intellegence such as, like, being taught by and gained some basic knowledge.
But I dont know how to classify those knowledges as I dont know how my knowledge is deployed in real life either.
Just do some basic things when it comes to analysis itself, but not instructions every time. Im so exhausted about that even though it seems like a good idea such as letting it read some books then test it with real senarios. But it is actually hard.
Does anyone have any ideas? Or papers for that? Or thoughts? Resources? Any resource would be appreciated!

curt.kennedy · April 30, 2023, 12:35am

Embeddings maybe?

embeddings = knowledge

hachemimohamed99 · April 20, 2024, 11:08am

Hello how can you let gpt draw directly from data ? meaning restrict it only to answer from a vector database ?

_j · April 20, 2024, 11:54am

Give each line of text that is returned a line number.

Tell the AI that it can only reproduce the numbers of the lines that are important for answering in a JSON.

Print the lines.

Even if you provide clear new knowledge, there’s still an AI waiting for the worst time to say “I’m sorry, but as of my last update…”.

Topic		Replies	Views
Fine tuning vs. Embedding API	21	45939	December 12, 2023
Can someone make embeddings make sense? (Not what you think, more in post, lets discuss!) API embeddings , gpt-4	6	2286	September 19, 2023
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17861	December 17, 2023
FAQ on custom data to support company internal API	27	5350	December 18, 2023
Using a specific knowledge base with GPT API	12	19454	December 13, 2023

How to structure the embeddings?

Related topics