The length of the embedding contents

I’ve been thinking of embedding a bunch of books, mostly non-technical advice books, or philosophy books. This would be a side-project, and not work related for me.

Based on the conversations in this thread and others, here is my embedding strategy that I’ve come up with:

  • Embed every 3 paragraphs, slide one paragraph at a time (so ~66% overlap) Or maybe make all chunks disjoint to make things easier later.
  • Each idea is contained in at most 3 paragraphs (~500 tokens)
  • Each embedding has metadata on starting paragraph number, ending paragraph number (used later to de-overlap and coherentize)
  • Could also contain metadata on Chapter / Author / Page, etc., but really need TITLE so as not to mix books if I need to coherently stick adjacent chunks together. If I go with disjoint non-overlapping chunks, this doesn’t matter so much.
  • I would not mix the metadata in the embedding, have it as separate data and retrieve it for the prompt to examine if necessary, because of the thought:
  • Don’t contaminate the embedding with the metadata, only embed ideas and content, keep metadata separate in the DB.. I don’t plan on querying on the author/title, that’s the main reason for me. It fun to see what pops up, and the metadata will be available in prompt, since I can return the adjacent metadata, but it won’t be directly embedded.

So here’s my next thought, since GPT-4 has at a minimum 8k context. I was wondering if I should embed more at once, maybe 6 paragraphs at 33% overlap?

It’s going to be trial and error.

Then I am going to hook this up to my personal assistant SMS network that I’ve built, so I can use it anywhere in the world from my cell phone.

8 Likes