I’m currently build a RAG system from scratch (not using OpenAI embedding/vectors), but much smaller with only 10-20 document. I’m still new at all this, vibe coding and using LLMs to help with building it, but there’s something in my workflow I’d like to share (plz correct me if it’s wrong):
For any RAG-based system, especially with large corpora (50k+ documents), it’s essential to perform top-K cosine similarity chunk selection when retrieving context.
That means:
- Embedding the user query into a vector
- Calculating cosine similarity between that query vector and every chunked document vector
- Ranking the results by similarity
- Injecting only the top-K most relevant chunks into the prompt context sent to the LLM
This step ensures:
- You stay within the model’s context window
- The LLM sees only the most relevant content
- You reduce hallucination risk and improve grounding
- You don’t overload the model with loosely related or redundant material
Example: GPT-4o with a 128k Token Limit
If you’re using GPT-4o or GPT-4-turbo (with a 128,000-token context window), and you reserve:
- ~4,000 tokens for your system prompt and user query
- ~8,000 tokens for the model’s generated answer
You’re left with ~116,000 tokens for injecting context chunks.
Here’s what that allows:
Avg Chunk Size | # of Chunks You Can Fit | Notes |
---|---|---|
~1,000 tokens | ~116 chunks | Small chunks, more coverage |
~2,000 tokens | ~58 chunks | Medium balance |
~4,000 tokens | ~29 chunks | Fewer, but longer |
These chunks can come from any number of documents , what’s important is that only the most relevant ones are selected and injected based on similarity to the user’s query.
This is a critical (but often unstated) operational step in any production-grade RAG system. Without it, systems tend to either:
- Exceed context limits,
- Waste space with irrelevant info,
- Or hallucinate due to lack of clear grounding.
One more thing, my workflow starts with “mindfull chunking”. I created a custom Python pipeline with regex-based parsing and structural awareness to chunk documents along meaningful boundaries, like section headers, legal clauses, and bullet points, ensuring each chunk preserves semantic coherence for accurate embedding and retrieval.
I’m working under the assumption that how wecreate chunks in the first place is imporant. Naively splitting every 1,000 words or tokens, without regard for sentence boundaries, section headers, or semantic coherence, will result in lower-quality embeddings and poorer retrieval performance. A well-chunked document respects the internal structure of the text and may use techniques like sentence-windowing, overlap, or hierarchical metadata to preserve context. Good chunking is not just about size; it’s about meaningful boundaries, which directly influence relevance scoring and response accuracy downstream.