Building first RAG system

I’m currently build a RAG system from scratch (not using OpenAI embedding/vectors), but much smaller with only 10-20 document. I’m still new at all this, vibe coding and using LLMs to help with building it, but there’s something in my workflow I’d like to share (plz correct me if it’s wrong):

For any RAG-based system, especially with large corpora (50k+ documents), it’s essential to perform top-K cosine similarity chunk selection when retrieving context.

That means:

  • Embedding the user query into a vector
  • Calculating cosine similarity between that query vector and every chunked document vector
  • Ranking the results by similarity
  • Injecting only the top-K most relevant chunks into the prompt context sent to the LLM

This step ensures:

  • You stay within the model’s context window
  • The LLM sees only the most relevant content
  • You reduce hallucination risk and improve grounding
  • You don’t overload the model with loosely related or redundant material

Example: GPT-4o with a 128k Token Limit

If you’re using GPT-4o or GPT-4-turbo (with a 128,000-token context window), and you reserve:

  • ~4,000 tokens for your system prompt and user query
  • ~8,000 tokens for the model’s generated answer

You’re left with ~116,000 tokens for injecting context chunks.

Here’s what that allows:

Avg Chunk Size # of Chunks You Can Fit Notes
~1,000 tokens ~116 chunks Small chunks, more coverage
~2,000 tokens ~58 chunks Medium balance
~4,000 tokens ~29 chunks Fewer, but longer

These chunks can come from any number of documents , what’s important is that only the most relevant ones are selected and injected based on similarity to the user’s query.

This is a critical (but often unstated) operational step in any production-grade RAG system. Without it, systems tend to either:

  • Exceed context limits,
  • Waste space with irrelevant info,
  • Or hallucinate due to lack of clear grounding.

One more thing, my workflow starts with “mindfull chunking”. I created a custom Python pipeline with regex-based parsing and structural awareness to chunk documents along meaningful boundaries, like section headers, legal clauses, and bullet points, ensuring each chunk preserves semantic coherence for accurate embedding and retrieval.

I’m working under the assumption that how wecreate chunks in the first place is imporant. Naively splitting every 1,000 words or tokens, without regard for sentence boundaries, section headers, or semantic coherence, will result in lower-quality embeddings and poorer retrieval performance. A well-chunked document respects the internal structure of the text and may use techniques like sentence-windowing, overlap, or hierarchical metadata to preserve context. Good chunking is not just about size; it’s about meaningful boundaries, which directly influence relevance scoring and response accuracy downstream.

Hi @lucmachine , great exercise. Reading the above I would start by selecting simple yet powerful tools to keep me in rails:

  1. Database / API wrapper around the data / GUI workflows tool / MCP server: Directus
  2. Vector store / Vector management/ API: weaviate
  3. Any decent coding editor : your choice
  4. Copilot / Coding assistant: your choice

Thanks.

I suppose I should have mentioned that the final ojbective is to create a chatbot.

My tentative workflow is as follows (noting that this for a MVP - testing a use case with low cost):

Layer Technology Purpose Implementation Details
Document Preprocessing Python + Regex + python-docx Semantic chunking Custom logic for headings, clauses, structure preservation. Run locally, upload to Supabase
Embedding Generation Hugging Face API from Edge Function Vector generation Deno-based edge function calling HF API for embeddings
Vector Storage Supabase pgvector Semantic search database Native PostgreSQL with vector similarity (same database)
Query Vectorization Hugging Face API Query-to-vector conversion Same API call from edge function
Chunk Retrieval Native SQL in Edge Function Top-K selection Direct SQL queries to pgvector within same Supabase instance
LLM Completion OpenAi/Claude/Gemini API Response generation API calls from Supabase edge functions
Frontend React/Next.js or WIX Chat interface Any frontend calling Supabase edge function endpoints
Backend Supabase Edge Functions Integrated serverless TypeScript/Deno functions with direct database access

Nb: Using Wix because I alread have an account. If I scale up, I’ll find better back and frontend solutions.

Anything glaringly off?

1 Like