Right way to calculate embeddings for a thema

There is a website, where a content goes about same general subject, imagine it like the whole website describes different aspects of a single watch model, like “rolex submariner”, where each URL is an article about a sub-topic.

What is the proper way to calculate embeddings for the site content?

  1. Would you calculate embedding for every single URL?
  2. Or would you put the content of all URLs, structured with things like markdown into sections, in a single file, and than calculate embeddings for the whole content package?

My feeling says, it should be better done on the second way. But it is just a feeling.

What is the expected use for the embedding?

More specifically: are you expecting users of the website to ask questions about the individual aspects of the watch model? Does the website just cover one watch model or multiple watch models?

Expected use is a knowledge file for customGPT to answer questions to the whole thema (all aspects).

In that case I would go with option 2. Assuming you are using the knowledge file capability in the custom GPT, you really just have to upload the file and then ensure to reference the knowledge base in your instructions. All the background work in terms of the chunking of your document and conversion into embeddings is done in the background.

You can read up a bit more on the knowledge capability for GPTs including best practices under the following two links:

https://help.openai.com/en/articles/8843948-knowledge-in-gpts

https://help.openai.com/en/articles/8868588-retrieval-augmented-generation-rag-and-semantic-search-for-gpts

1 Like