@RonaldGRuckus
I tried so many methods to perfect my retrieval algorithm but I never tried 0% overlap. Will definitely give it a shot. Although I am indeed trying to automate the embedding process for technical API documentation retrieval, I do have complete control over the API references I provide as docs.
You don’t have to be bound by “rules”.
You can embed the chunks split at divisions, but then the vector database can provide text that goes beyond the embedded boundary.
You can promote in-document by adjacency, and decide when on-topic context should be neighboring chunks concatenated.
You can rebuild documents out of the ordered (not ranked) chunks of relevancy, and give the document a pregenerated AI summary.
You can use your imagination.
I agree with the no overlap part, I initially did that and it was causing issues, then moved to 300 tokens (no overlap) and then do the parent retrieval. Problem is that I still have context overlap among the chunks.
No overlap required in this approach. Just atomic ideas: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop
Also, check out the @sergeliatko approach: Using gpt-4 API to Semantically Chunk Documents - #10 by sergeliatko
Great words. Even the sky is not the limit. Personally I try to be as close as possible to what the business logic requires as info accessible via RAG