Building first RAG system

I’m currently build a RAG system from scratch (not using OpenAI embedding/vectors), but much smaller with only 10-20 document. I’m still new at all this, vibe coding and using LLMs to help with building it, but there’s something in my workflow I’d like to share (plz correct me if it’s wrong):

For any RAG-based system, especially with large corpora (50k+ documents), it’s essential to perform top-K cosine similarity chunk selection when retrieving context.

That means:

  • Embedding the user query into a vector
  • Calculating cosine similarity between that query vector and every chunked document vector
  • Ranking the results by similarity
  • Injecting only the top-K most relevant chunks into the prompt context sent to the LLM

This step ensures:

  • You stay within the model’s context window
  • The LLM sees only the most relevant content
  • You reduce hallucination risk and improve grounding
  • You don’t overload the model with loosely related or redundant material

Example: GPT-4o with a 128k Token Limit

If you’re using GPT-4o or GPT-4-turbo (with a 128,000-token context window), and you reserve:

  • ~4,000 tokens for your system prompt and user query
  • ~8,000 tokens for the model’s generated answer

You’re left with ~116,000 tokens for injecting context chunks.

Here’s what that allows:

Avg Chunk Size # of Chunks You Can Fit Notes
~1,000 tokens ~116 chunks Small chunks, more coverage
~2,000 tokens ~58 chunks Medium balance
~4,000 tokens ~29 chunks Fewer, but longer

These chunks can come from any number of documents , what’s important is that only the most relevant ones are selected and injected based on similarity to the user’s query.

This is a critical (but often unstated) operational step in any production-grade RAG system. Without it, systems tend to either:

  • Exceed context limits,
  • Waste space with irrelevant info,
  • Or hallucinate due to lack of clear grounding.

One more thing, my workflow starts with “mindfull chunking”. I created a custom Python pipeline with regex-based parsing and structural awareness to chunk documents along meaningful boundaries, like section headers, legal clauses, and bullet points, ensuring each chunk preserves semantic coherence for accurate embedding and retrieval.

I’m working under the assumption that how wecreate chunks in the first place is imporant. Naively splitting every 1,000 words or tokens, without regard for sentence boundaries, section headers, or semantic coherence, will result in lower-quality embeddings and poorer retrieval performance. A well-chunked document respects the internal structure of the text and may use techniques like sentence-windowing, overlap, or hierarchical metadata to preserve context. Good chunking is not just about size; it’s about meaningful boundaries, which directly influence relevance scoring and response accuracy downstream.

Hi @lucmachine , great exercise. Reading the above I would start by selecting simple yet powerful tools to keep me in rails:

  1. Database / API wrapper around the data / GUI workflows tool / MCP server: Directus
  2. Vector store / Vector management/ API: weaviate
  3. Any decent coding editor : your choice
  4. Copilot / Coding assistant: your choice

Thanks.

I suppose I should have mentioned that the final ojbective is to create a chatbot.

My tentative workflow is as follows (noting that this for a MVP - testing a use case with low cost):

Layer Technology Purpose Implementation Details
Document Preprocessing Python + Regex + python-docx Semantic chunking Custom logic for headings, clauses, structure preservation. Run locally, upload to Supabase
Embedding Generation Hugging Face API from Edge Function Vector generation Deno-based edge function calling HF API for embeddings
Vector Storage Supabase pgvector Semantic search database Native PostgreSQL with vector similarity (same database)
Query Vectorization Hugging Face API Query-to-vector conversion Same API call from edge function
Chunk Retrieval Native SQL in Edge Function Top-K selection Direct SQL queries to pgvector within same Supabase instance
LLM Completion OpenAi/Claude/Gemini API Response generation API calls from Supabase edge functions
Frontend React/Next.js or WIX Chat interface Any frontend calling Supabase edge function endpoints
Backend Supabase Edge Functions Integrated serverless TypeScript/Deno functions with direct database access

Nb: Using Wix because I alread have an account. If I scale up, I’ll find better back and frontend solutions.

Anything glaringly off?

2 Likes

@lucmachine - If I were to start today to build a RAG KnowledgeBot POC , this is where I would start

  1. Cursor (IDE) - Makes me a polyglot.
  2. UX - Lovable Landing Page to host the chatbot & Next.js ( check out https://ai-sdk.dev/) ( Lovable integrates well with Supabase too so great for starter projects)
  3. Document Processing - Use LLM as a Document Processor - using your extensive prompt engineering , you could also try AgentSDK ( it’s really powerful)
  4. Vector DB - anything that’s easy to inspect and connect to Supabase- pgvector
  5. My model preferences may be biased but keep everything within the OpenAI ecosystem for faster poc’s.
    Cost to build : <=$100 (Including Cursor + Lovable + OpenAI API )
    When your data sizes grow and data modalities differ (ex: pdf with images) - you can look at the various chunking/embedding strategies - multimodal embeddings , hybrid search etc.
2 Likes

I think this is great. I remember our original conversation where you posted this as well. Personally, I think it looks good.

What are you difficulties/what stage are you at now? How is it going? What are you going to use to produce all the code for this/what environment do you have setup?

The “LLM completion” is going to be the interesting part, in my opinon. I.e. let’s say you’ve made it through the “preprocessing, embedding generation, vector storage”, etc.

Then the you prompt the LLM. How are you providing the LLM with the necessary ability to “navigate” the databases/vectors? Are you dumping everything, or are you actually pre-processing your own LLM prompt as a “query vector”, and then actually calling the vector DB → returning the result to the LLM, and asking the LLM to further provide content?

I.e. with all that vectorization/etc., then it seems like the question is, what do you want the LLM to actually DO? You want it to turn your natural language-query into a proper query to the DB, and then interpret the results?

As a whole, I’m very interested in your system, because it’s close to my heart as something I’d like to achieve as well. I’ve spent all my time however “developing a chatbot-LLM interaction system” that will then be able to “create this system” (i.e. the one you are describing!). So I’m really interested in how your going to go about this. And maybe there’s some kind of cross-over collaboration potential… I still haven’t achieved my fully automated workflow, but if your comfortable with basic dev environments, I’d be open to letting you try out my system and install it on your machine, it’s a complete and fully functioning integrated code editor, really… and the results of your own project would be more than interesting to me for my own purposes (not related to legal documents or producing something saleable).

Let me know if your interested… I could message you a demonstration video… and see if you think my system would be useful to you in allowing you to develop your own project..

3 Likes

The workflow and code is all LLM produced. Having said that, I do have a soft programing background, so I know how to structure and ask the right questions.

I spend the last three months working out the RegEx strategy. I also wasted 1 month trying SpaCy, only to realize it’s not the right tool. It might sound crazy, but the first step, cleaning source documents, is much more difficult than expected. All the Python tools (e.g., PyMuPD, pdfplumber, etc., ) are very basic and not precise enough for what I’m doing.

I’m in a specialized field (i.e., carbon market with accounting and ESG framework), the PDF guidance and standards are full of tables, nested bullet lists, etc. So I started out by only extracting sentences that were relevant (e.g., sentences that contained specific words like “shall”, “should”, “must”, etc.). This is because compliance matter to meet regulations and verification audits.

I’m at a place where RegEx is doing 90% expected.

Then I spent about month a month understanding how to compare two sentences for semantic similarity using the cosine similarity function.

Check it out; this is: 1) RegEx output and 2) SBERT + cosine similarity [examining how two different ESG frameworks are similar at the compliance level]

All coded in Python, for example here is what’s needed for (2):

python

import os
import sys
import time
import string
import logging
import itertools
import numpy as np
import pandas as pd
import torch
import nltk
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from psycopg2 import connect
from psycopg2.extras import execute_values
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer, util
from transformers import BertTokenizer, BertModel
import sys

Then LLM showed me how to create the schema and get the data into supabase.

I"m now creating vectors and uploading. I might not need LLM tags, because my RegEx script also searched for key words (from a list) inside each sentence. Yes these sentences will become “chunks”.

Thanks for the workflow. I’ll check it out, for sure!

@lucmachine Your tentative workflow looks solid, and I quite like the idea of ‘mindful chunking’! I use a simliar approach at work, where I try to get familiar with the structure, content, similarity, etc of the corpus I’m dealing with before thinking about the rest of the workflow. It’s helpful to involve the subject matter experts too; big gains can be achieved at this stage itself. This best practices section published by OpenAI is nice, and it talks about some of these strategies (might be handy once you have a Proof of Concept app up and running, and want to improve the results).

I will say that if you’re not intending to use more than a few documents, you could bypass the vectorization process, and use an LLM-as-a-Retriver instead. There’s a few different iterations out there (and here’s one for reference), but the core idea is to pass the entire context of a single chunk in a single API call to a small/more affordable model for the sole purpose of deciding whether the chunk is relevant or not. You could scale this and run the entire process in parallel (n number of API calls in parallel for n number of chunks, depending on your specific usage tier, and since time to first token is small for the mini models, this process can be quite fast in practice. Once all relevant chunks have been identified, the ‘AG’ part of RAG stays the same really. You could also use the LLM-as-a-Retriever in combination with vectors as an intermediate step to have even more control on the context that gets passed downstream, and not worry about whether the RegEx was 100% accurate or not.

Hope this helps. Looking forward to seeing how it turns out!

3 Likes

@ jai

Here is what LLM thinks of your recommendation. You tell me, it is correct?

I see the proposed ReAG workflow from the superagent-ai/reag repo, and I’ve fully reviewed your original GHG Protocol Expert Chatbot plan. Here’s a side-by-side synthesis and verdict:


:balance_scale: Comparison: Your Plan vs. ReAG

Feature Your Architecture ReAG (Reasoning Augmented Generation)
Input Handling Chunked document ingestion with semantic structure Full document ingestion (no chunking)
Retrieval Approach Vector similarity (via pgvector + embedding model) Full document review by LLM at query time
LLM Input Top-K chunks with semantic similarity Whole documents (or large sections) fed directly
Response Strategy RAG: context-restricted completion Reasoning + holistic synthesis
Frontend WIX + Velo or Next.js (your choice) Not opinionated; you choose
Backend Supabase Edge Functions or FastAPI You can wrap ReAG logic in API calls
Performance Fast (retrieves top-K, low-cost) Slower, higher token usage
Cost ~$0.02–$0.05/query (OpenAI GPT-4o) Much higher — full document tokens sent to LLM every time
Control You define chunking, tagging, scoring ReAG leaves more control to the LLM
Scalability Strong, especially with indexed chunks Weak — large context windows don’t scale well
Risk of Hallucination Mitigated via retrieved citations Higher without chunk filtering
Best For Structured docs, accuracy, citations Unstructured docs, deep synthesis, exploratory queries

:brain: Core Trade-off

ReAG offers “human-like synthesis” by letting the LLM reason across whole documents. This can lead to deeper answers — but it’s:

  • Token-intensive (expensive)
  • Harder to cite precisely
  • Less efficient at scale
  • More prone to hallucination if the doc isn’t 100% relevant

Your approach — with structured semantic chunking, pgvector, metadata filters, and explicit prompt control — is far more scalable, auditable, and reliable for professional GHG use cases.


:white_check_mark: Verdict

For the GHG Protocol Expert Chatbot, stick with your current architecture.

ReAG is interesting, but your users (GHG professionals) need:

  • Precision
  • Source citation
  • Predictable response behavior
  • Cost-efficiency

These are all areas where your architecture outperforms ReAG.


:wrench: Option: Hybrid Enhancement (If You Want the Best of Both)

You could borrow ReAG’s reasoning style selectively:

  • Run a follow-up prompt after retrieval: “Based on the following chunks, synthesize an answer with broader reasoning if needed.”
  • Use the retrieved metadata to re-rank with LLM assistance.
  • Build a “reasoning mode” for edge cases — but keep your default flow RAG-based.

@lucmachine The ReAG repo I linked is just one flavor (and more so the blueprint) of this approach. You could definitely (and should) use chunks instead of doing full document ingestion. As for the costs, yes, each API call will cost you, and it could become cost prohibitive over time. The hybrid approach might be a good one, if your current workflow yields results that are not accurate enough.

I’m not familiar with the GHG documents, but “human-like synthesis” as the LLM says, can be helpful in those cases where vector similarity fails. The comparison the LLM provided is fair, but I’m not sure if it has taken the nature (content, structure, etc.) of your corpus into account; that should determine the right approach. I quite like this approach:

Use the retrieved metadata to re-rank with LLM assistance.

Wouldn’t using openai vectors simplify things, any specific reason for not using openai?

I think there is merit in the ReAG approach. I have been looking at the openai vector store API and it seems to me even openai may have abandoned some of its planned RAG features for ReAG.
For instance, the API provides for ranking options which can be used as follows:

              const ranking_options = {ranker: 'auto', score_threshold: 0.1 };
              const res = await oai.vectorStores.search(vectorStoreId, {
                query,
                max_num_results: maxResults,
                ranking_options: ranking_options
              });

and the expectation is it returns results based on the threshold. On testing, this feature isn’t implemented, the results contain matches above and below the threshold. That is all matches.

They also have a feature to filter results using the attributes on the file, using which the results can be further refined. I haven’t checked if it works or not, I plan to do that today.

The OpenAI Assistant uses the file_search tool with the openai vectors and it works pretty well.
Now, if you have used the assistant, you would have seen that it returns a response with citations - there could be 1 to 3 citations, haven’t seen more than that. Now, if the vector search is returning upto 20 results, but the response is only citing 1 to 3, the question is where does this filtering happen? My guess is that OpenAI also uses ReAG for its file search, it probably passes the entire result back to the LLM with the to pick only the relevant ones, something like below:

–  Below are SOURCE_CHUNKS to use, ONLY select relevant SOURCE_CHUNKS to answer.
– When you quote or paraphrase any SOURCE_CHUNK shown below,
  append its bracket id in square brackets.
– Only cite a chunk that directly supports the statement.
– Do not invent citations.

### SOURCES

1 Like

Thanks @jknt! I had a feeling ReAG was being utilized somewhere in the process currently. While trying o3, I did catch a glimpse of this approach when I uploaded a few files; the model thought or rather said to itself “let me read this part to make sure I understand it”. Now I’m not sure whether it actually went ahead and did so, or if other models are also using this hybrid strategy. However, with token prices decreasing and (effective) context windows expanding, ReAG is looking promising for at least some use cases.

1 Like

My understanding is that you can use OpenAI’s prebuilt solution if you don’t care about precision. Actually, I might use this for a MVP, to show the world what can be done.

However, and I could be wrong, but OpenAI’s RAG workflow is not very “mindful”. Especially with having ‘clean and proper source documents’ and ‘mindful chunking’. Does this matter? Yes and no, you decide.

The way I see it, the world will have 90% of ChatBots being, “Meh”… The other 10% will be greate because they are built with mindfulness and best practices.

You are right about openai vectors not being mindful, because it has a fixed size chunking strategy - which isn’t mindful of document structure. But also like you mentioned it is a great starting point and may well become something to stick to - considering documentation structure & chunking strategies go hand in hand. So we can also optimise the document for better results.

My plan is to use openai with multiple smaller vector stores for individual areas, for example a vector store which focusses on architecture of product which has large chunk sizes. A different vector store for FAQs which may have smaller chunks. And then tying these together using an agentic flow. I prefer this approach because it helps me move towards agentic flows rather than depend on large contexts which are more expensive.

1 Like

Ok, then, make sure the source documents are good.

People create ‘key in hand tools’, can’t do anything about that. But, mindful programmers can.

It only takes a few minutes to convert a PDF to .DOCX, then run a macro to remove and clean a few things, including isolating and using headers (sections and subsections) as metatags to help with splitting/chunking.

I suppose it depends on: 1) creating an App for people to use to create a ChatBot , or 2) creating a mindful ChatBot for a specialized niche.

90% of automated solutions will be… Meh. People upload a few PDF files, click a button, and get a ChatBot.

1 Like

You might want to check this multimodal agentic RAG and RAG with memory I open source to give you some ideas: github. com/pixeltable/pixelbot

Here’s my RAG:

# === DOCUMENT PROCESSING ===
# Create a table to store uploaded documents.
# Pixeltable tables manage schema and efficiently store references to data.
documents = pxt.create_table(
    "agents.collection",
    {"document": pxt.Document, "uuid": pxt.String, "timestamp": pxt.Timestamp, "user_id": pxt.String},
    if_exists="ignore",
)
print("Created/Loaded 'agents.collection' table")

# Create a view to chunk documents using a Pixeltable Iterator.
# Views transform data on-demand without duplicating storage.
# Iterators like DocumentSplitter handle the generation of new rows (chunks).
chunks = pxt.create_view(
    "agents.chunks",
    documents,
    iterator=DocumentSplitter.create(
        document=documents.document,
        separators="paragraph",
        metadata="title, heading, page" # Include metadata from the document
    ),
    if_exists="ignore",
)

# Add an embedding index to the 'text' column of the chunks view.
# This enables fast semantic search using vector similarity.
chunks.add_embedding_index(
    "text",  # The column containing text to index
    string_embed=sentence_transformer.using( # Specify the embedding function
        model_id=config.EMBEDDING_MODEL_ID
    ),  # Use model from config
    if_exists="ignore",
)


# Define a reusable search query function using the @pxt.query decorator.
# This allows calling complex search logic easily from other parts of the application.
@pxt.query
def search_documents(query_text: str, user_id: str):
    # Calculate semantic similarity between the query and indexed text chunks.
    sim = chunks.text.similarity(query_text)
    # Use Pixeltable's fluent API (similar to SQL) to filter, order, and select results.
    return (
        chunks.where(
            (chunks.user_id == user_id)
            & (sim > 0.5)  # Filter by similarity threshold
            & (pxt_str.len(chunks.text) > 30) # Filter by minimum length
        )
        .order_by(sim, asc=False)
        .select(
            chunks.text,
            source_doc=chunks.document,  # Include reference to the original document
            sim=sim,
            title=chunks.title,
            heading=chunks.heading,
            page_number=chunks.page
        )
        .limit(20)
    )
etc...

Hi pbrunelle,

Thank you for taking the time to share your Pixeltable-based approach. It’s clear you’ve put thoughtful work into designing a modular and developer-friendly document pipeline, and I appreciate the clarity of your implementation.

After reviewing the structure more closely, I’m sure I want to commit to a proprietary framework, especially one that relies on specialized abstractions for document formats and chunking. While Pixeltable is promising, it’s still a work in progress (looking at the Pricing webpage), and I’d need greater transparency, control, and long-term flexibility before integrating it into our core systems.

At the heart of my concern is a structural reality I’ve encountered over and over again: PDFs are fundamentally presentation formats, not semantic ones. They lack true headers, structural tags, or consistent delimiters. What appears visually as a paragraph, section, or bullet list is often just a rendering issue, and parsing them reliably requires more than delimiter-based chunking.

In my experience:

  • Line breaks are not sentence boundaries
  • Tables and lists often defy standard parsers
  • Bullet levels don’t translate to logical hierarchy
  • Word conversions often flatten essential structure

In short: semantic chunking can’t just mean splitting on paragraphs or estimating titles. It demands a layer of interpretation, an intelligent preprocessing stage that respects the actual architecture of meaning embedded in these documents.

Your system abstracts many valuable parts of the pipeline. It’s critical to retain low-level control. I’m sure it can work for many simple use cases, especially clean source documents.

Thanks again for your generosity and vision. I’ll keep an eye on how Pixeltable evolves, and open to re-engaging down the line once things stabilize and open up further.

Warm regards,
Luc