How to structure the embeddings?

Fine-tuning training is clear: question-answer, question-answer.

But how to best structure embeddings? Should an embedding be also a question-answer? Should it be like a howto article?

My purpose is to put my opinions on a certain topic into a model, so providing the customers automated philosophy discussion “with me”.

1 Like

A great explanation of vector embeddings: Brainstorming ChatGPT Business Ideas With A Billionaire (#438) - YouTube

Text embeddings and semantic search: Text embeddings & semantic search - YouTube

1 Like

There are many ways to do it. But to be optimal, you want the embedding of the input to lock-on to the most relevant content in your vector database.

So if you expect someone to ask “What color is the sky?”, you should have embedded a similar question in your vector database. Then there is the signal to noise question (SNR). You could embed on your end only the question: “What color is the sky?” or you could embed the question and answer pair “Q: What color is the sky? A: Blue”. The latter embedding with the answer “Blue” will have a diluted correlation compared to the original question. This goes for anything where the answer is de-correlated with the question.

So if you embed question only, and embed and correlate question-to-question, you’ve increased the SNR! But your backend database will have write out both the question and answer to feed to your prompt (so more complexity), vs. the lower complexity Q/A composite embedding (but lower SNR).

But you also mention “philosophy”. Most philosophical writings aren’t question/answer pairs. They are writings organized around different topics and thoughts. In this case, you should just embed chunks of this data, and hope or expect the question is correlated to your writings and have GPT draw from these correlated writings for the answer.

So, in a nutshell, to have high SNR correlation on questions, just embed the questions, and use the database to form the Q/A pairs for the prompt to draw from (high SNR, but more complexity), OR composite Q/A pairs for lower complexity and lower SNR.

And if you have a big corpus of text that isn’t structured this way (as Q/A pairs) then embed this data directly in chunks, and let GPT draw directly from this data for the answer.

There are obviously tons of permutations as well. For example, have an internal “AI Agent” ruminate over the question to derive “the real question” (reframing) and then draw on your embeddings for the answer. This could add some entropy to the system and give fun and unexpected results, especially in a philosophical debate bot!

1 Like

A strategy I used to achieve this was to submit each chunk of my text in a prompt completion which simply prompts: “Please generate (x) questions that this text answers.” Then include the questions in the chunks to to be embedded. It’s an extra step, and consumes more tokens, but is an automated approach to achieve SNR.


That’s a great idea! Basically cast a wider net on the input before correlation.

1 Like

I’m just wondering, given the example

Would it be possible to reduce the SNR by embedding the answers only? in this case something like “The sky is blue.” to increase the correlation between correlation between “sky” and “blue”?

1 Like

Yes that could work, maybe at a reduced SNR. This is equivalent to my big corpus of unstructured text.

Which, I’m guessing will reduce the SNR somewhat, but totally viable!

1 Like

Interesting, what I’m thinking is, that you can use GPT to increase the SNR and correlation before embedding. If i use the prompt “Generate very semantically compressed factual statements based on the following:” on the history section of the machine learning wiki i get the following response:

Arthur Samuel coined “machine learning” in 1959, and by the 1960s, Raytheon Company developed CyberTron, an experimental learning machine. Interest in pattern recognition grew in the 1970s and 1980s. Tom M. Mitchell’s definition of machine learning emphasizes learning from experience and improving performance, aligning with Turing’s question, “Can machines do what we can do?”…

Compared to the original wiki, it seems to capture “the important stuff”


You could also do all the above.

(1) Embed the original incoming text, (2) embed the GPT expansion. Correlate both and gather a list on top hits for both. So List1 = Correlation on original, List2 = correlation on GPT expanded. Merge these two lists together, and take the embeddings based on the top correlations in the merged list.

This way you are giving the original input a shot at achieving the highest correlation, and using the GPT expansion as a backup.


That seems like a really good idea :laughing:

1 Like

For me it looks like a bad idea:

If a user’s question will correlate with a question in our embedding, then the AI would ignore the answer part and therefore base the answer only on the base model knowledge, ignoring that we have an answer.

Isn’t it so?

Remember the embeddings all correlate and map back to YOUR DATA!

So all this is trying to do is smooth out the interface between <Random Question> and <Company Approved Answer>.

A big problem is this lack of correlation of the question (or input) and your data. And you don’t want a “weak question” to correlate with off-topic data either. You can also tell GPT to respond with “I don’t know” if the answer doesn’t lie within the data that was pulled from the embedding correlations.

Not if you fed the answer back to the prompt!

Example input: “What color sky!”

Your database:
“What color is the sky?”
“Is the earth flat?”
“Where is Waldo?”

The correlation locks onto the first one. So you look this up, and also the answer in your database (in another field), and put this in the prompt:

Use the following context to answer the question:

Q: What color is the sky?
A: Blue

Q: What color sky!


The Q/A pair was from your embedded data, and the original question is fed back in and GPT will reference your data to answer it!

1 Like

I have the same problem!
I want it to deploy some human intellegence such as, like, being taught by and gained some basic knowledge.
But I dont know how to classify those knowledges as I dont know how my knowledge is deployed in real life either.
Just do some basic things when it comes to analysis itself, but not instructions every time. Im so exhausted about that even though it seems like a good idea such as letting it read some books then test it with real senarios. But it is actually hard.
Does anyone have any ideas? Or papers for that? Or thoughts? Resources? Any resource would be appreciated!

Embeddings maybe?

embeddings = knowledge