The length of the embedding contents

curt.kennedy · March 23, 2023, 3:48pm

I’ve been thinking of embedding a bunch of books, mostly non-technical advice books, or philosophy books. This would be a side-project, and not work related for me.

Based on the conversations in this thread and others, here is my embedding strategy that I’ve come up with:

Embed every 3 paragraphs, slide one paragraph at a time (so ~66% overlap) Or maybe make all chunks disjoint to make things easier later.
Each idea is contained in at most 3 paragraphs (~500 tokens)
Each embedding has metadata on starting paragraph number, ending paragraph number (used later to de-overlap and coherentize)
Could also contain metadata on Chapter / Author / Page, etc., but really need TITLE so as not to mix books if I need to coherently stick adjacent chunks together. If I go with disjoint non-overlapping chunks, this doesn’t matter so much.
I would not mix the metadata in the embedding, have it as separate data and retrieve it for the prompt to examine if necessary, because of the thought:
Don’t contaminate the embedding with the metadata, only embed ideas and content, keep metadata separate in the DB.. I don’t plan on querying on the author/title, that’s the main reason for me. It fun to see what pops up, and the metadata will be available in prompt, since I can return the adjacent metadata, but it won’t be directly embedded.

So here’s my next thought, since GPT-4 has at a minimum 8k context. I was wondering if I should embed more at once, maybe 6 paragraphs at 33% overlap?

It’s going to be trial and error.

Then I am going to hook this up to my personal assistant SMS network that I’ve built, so I can use it anywhere in the world from my cell phone.

Topic		Replies	Views
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	44125	December 12, 2023
Train (fine-tune) a model with text from books or articles API	62	25204	November 30, 2023
Fine tuning a model for customer service for our specific app Prompting	23	12231	May 14, 2024
How to prevent ChatGPT from answering questions that are outside the scope of the provided context in the SYSTEM role message? API	53	150367	December 2, 2023
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	15622	December 17, 2023

The length of the embedding contents

Related Topics