I’ve been thinking of embedding a bunch of books, mostly non-technical advice books, or philosophy books. This would be a side-project, and not work related for me.
Based on the conversations in this thread and others, here is my embedding strategy that I’ve come up with:
- Embed every 3 paragraphs, slide one paragraph at a time (so ~66% overlap) Or maybe make all chunks disjoint to make things easier later.
- Each idea is contained in at most 3 paragraphs (~500 tokens)
- Each embedding has metadata on starting paragraph number, ending paragraph number (used later to de-overlap and coherentize)
- Could also contain metadata on Chapter / Author / Page, etc., but really need TITLE so as not to mix books if I need to coherently stick adjacent chunks together. If I go with disjoint non-overlapping chunks, this doesn’t matter so much.
- I would not mix the metadata in the embedding, have it as separate data and retrieve it for the prompt to examine if necessary, because of the thought:
- Don’t contaminate the embedding with the metadata, only embed ideas and content, keep metadata separate in the DB.. I don’t plan on querying on the author/title, that’s the main reason for me. It fun to see what pops up, and the metadata will be available in prompt, since I can return the adjacent metadata, but it won’t be directly embedded.
So here’s my next thought, since GPT-4 has at a minimum 8k context. I was wondering if I should embed more at once, maybe 6 paragraphs at 33% overlap?
It’s going to be trial and error.
Then I am going to hook this up to my personal assistant SMS network that I’ve built, so I can use it anywhere in the world from my cell phone.