I have tried to implement something like this -
Technical Implementation
The system is designed to ingest, index, search, and retrieve Python code snippets efficiently while providing a chatbot interface to assist users in understanding and exploring code. Let’s break it down into key components:
1. Code Ingestion and Indexing
To enable efficient code search, we extract functions and classes from Python files and store their metadata in a database.
1.1 Extracting Function/Class Names, Code, and Docstrings
We use Python’s built-in ast module to parse files and extract:
Function/Class Names
Code Body
Docstrings (Documentation within functions/classes)
Example Extraction Logic
import ast
def extract_functions_from_file(file_path):
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()
tree = ast.parse(content)
functions = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
name = node.name
start_line = node.lineno
end_line = node.body[-1].lineno if node.body else start_line
function_code = "\n".join(content.splitlines()[start_line - 1:end_line])
docstring = ast.get_docstring(node) or ""
functions.append((name, function_code, docstring))
return functions
Why store file path, docstring, chunk id?
file_path: Helps retrieve the original file where a function/class is defined.
docstring: Enables documentation-based search for better code understanding.
chunk_id: A unique identifier for each function (combines file name + function name).
1.2 Storing and Indexing Code with Embeddings
Once we extract function details, we store them in a database and generate embeddings using OpenAI’s text-embedding-ada-002 model.
Embedding Generation
embedding_response = client.embeddings.create(
input=[code + " " + docstring],
model="text-embedding-ada-002"
)
embedding = embedding_response.data[0].embedding
Why embeddings?
Traditional keyword-based search fails when looking for semantically similar code. By converting text into vector embeddings, we can perform similarity-based retrieval.
Word2Vec for Code Understanding
We train a Word2Vec model on function names, docstrings, and code tokens.
w2v_model = Word2Vec(tokens, vector_size=100, window=5, min_count=1,
workers=4)
Why Word2Vec?
Helps in synonym expansion (e.g., search_function vs. lookup_method).
Enables suggestions for similar function names.
2. Query Processing
To ensure efficient search and retrieval, we process user queries before searching.
2.1 Query Preprocessing (preprocess_query)
import re
import spacy
nlp = spacy.load("en_core_web_md")
def preprocess_query(query):
query = query.lower()
query = re.sub(r"[^a-z0-9\s]", "", query) # Remove special characters
doc = nlp(query)
words = {token.lemma_ for token in doc if token.pos_ in {"NOUN", "VERB", "PROPN"} and not token.is_stop}
# Synonym Expansion
words.update({syn for word in words if word in SYNONYM_DICT for syn in SYNONYM_DICT[word]})
return " ".join(words)
Why NLP & Lemmatization?
Removes stop words (the, is, a etc.).
Extracts meaningful words (find similar function → find function similar).
Expands synonyms for better recall.
3. Code Search Mechanism
3.1 Vector Similarity Search
Once a query is processed, we generate query embeddings and retrieve similar documents.
Query Embedding
query_embedding = client.embeddings.create(input=query,
model="text-embedding-ada-002").data[0].embedding
Vector Search (PostgreSQL with pgvector)
def search_similar_documents(query_embedding, top_k=3):
embedding_array = f"[{','.join(map(str, query_embedding))}]"
with connection.cursor() as cursor:
cursor.execute(
"""
SELECT id, title, content, docstring, file_path, embedding <=> %s::vector AS distance
FROM document
ORDER BY distance ASC
LIMIT %s;
""",
[embedding_array, top_k]
)
results = cursor.fetchall()
return results
Why pgvector?
Performs fast vector similarity searches.
Uses cosine similarity (<=>) to find the most relevant results.
3.2 Keyword-Based Search (Trigram Similarity)
To complement vector search, we use PostgreSQL’s TrigramSimilarity for text matching.
from django.contrib.postgres.search import TrigramSimilarity
keyword_results = Document.objects.annotate(
similarity=TrigramSimilarity("title", query) + TrigramSimilarity("docstring", query)
).filter(similarity__gt=0.3).order_by("-similarity")[:3]
Why Trigram Similarity?
Helps when query is misspelled (serch function → search function).
Matches partial words (find meth → find_method).
4. Chat Session Management
We store chat sessions so users can interact with the system over time.
4.1 Rate Limiting and Session Handling
MAX_QUERIES_PER_HOUR = 100
one_hour_ago = now() - timedelta(hours=1)
user_message_count = Message.objects.filter(
chat_session=chat_session, role="user", created_at__gte=one_hour_ago
).count()
if user_message_count >= MAX_QUERIES_PER_HOUR:
return Response({"error": "Query limit reached. Try again in an hour."}, status=429)
Why rate limiting?
Prevents abuse/spam.
Ensures API cost control (OpenAI API calls).
4.2 Summarizing Older Chats (Using tiktoken)
To prevent exceeding OpenAI’s token limits, we summarize old chats.
import tiktoken
TOKEN_LIMIT = 3000
encoding = tiktoken.encoding_for_model("gpt-4o-mini")
total_tokens = sum(len(encoding.encode(msg["content"])) for msg in chat_history)
if total_tokens > TOKEN_LIMIT:
summary_prompt = f"Summarize the chat:\n\n{chat_history}"
summary_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": summary_prompt}],
max_tokens=300
)
summary = summary_response.choices[0].message.content.strip()
Message.objects.create(chat_session=chat_session, role="system", content=summary)
Why tiktoken?
Calculates exact token count before sending requests.
Prevents exceeding OpenAI limits (4096 tokens per request).
5. AI Chat Completion
Finally, after retrieving relevant code snippets, we generate an AI response.
completion_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=chat_history,
max_tokens=500
)
answer = completion_response.choices[0].message.content.strip()
Why GPT-based completion?
Generates contextual answers based on retrieved code.
Supports natural language explanations.