####Introduction
Hi @jr.2509. I’ve run my first test and it works quite well.
I’ve converted a markdown txt file with my knowledge to csv embeddings (as as test i started with 6,000 tokens) and than incorporate them in a context window.
Thanks to your explanation, i’ve tested the model with various questions and the response are correct.
Here my process.
####Markdown text file into embeddings (CSV file)
- Reads the text from the specified file.
- Splits the text into chunks.
- Generates embeddings for each chunk.
- Saves the chunks and embeddings to a CSV file.
import openai
import os
import pandas as pd
import tiktoken
# Set up API key
openai.api_key = os.getenv("OPENAI_API_KEY", "sk-proj-12345...")
# Function to read the text file
def read_text_file(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
return text
# Function to chunk text into manageable parts
def chunk_text(text, max_tokens=2000, model="gpt-4o"):
encoding = tiktoken.encoding_for_model(model)
tokens = encoding.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk = tokens[i:i + max_tokens]
chunks.append(encoding.decode(chunk))
return chunks
# Function to generate embeddings for each chunk
def generate_embeddings(text_chunks, model="text-embedding-3-large"):
embeddings = []
for chunk in text_chunks:
response = openai.Embedding.create(
input=chunk,
model=model
)
embeddings.append(response['data'][0]['embedding'])
return embeddings
# Function to save embeddings to CSV
def save_embeddings_to_csv(chunks, embeddings, file_name="embeddings.csv"):
df = pd.DataFrame({
"text": chunks,
"embedding": embeddings
})
df.to_csv(file_name, index=False)
# Main execution
file_path = "(1) Il terreno di gioco - REGOLA 1.txt"
text = read_text_file(file_path)
chunks = chunk_text(text, max_tokens=2000, model="gpt-4o") # Adjust max_tokens as needed
embeddings = generate_embeddings(chunks, model="text-embedding-3-large")
save_embeddings_to_csv(chunks, embeddings)
####Incorporating embeddings into a context window
- Defines an example question in Italian.
- Uses the user’s question as the query to find relevant text chunks based on cosine similarity of embeddings.
- Constructs a context window from the relevant text chunks.
- Queries GPT-4o with the user’s question and the constructed context to get an answer in Italian.
- Prints the answer.
import openai
import os
import pandas as pd
import tiktoken
from sklearn.metrics.pairwise import cosine_similarity
# Set up API key
openai.api_key = os.getenv("OPENAI_API_KEY", "sk-proj-12345...")
# Load embeddings from the CSV file
df = pd.read_csv('embeddings-regola-1.csv')
# Function to get the embedding for a query
def get_query_embedding(query, model="text-embedding-3-large"):
response = openai.Embedding.create(
input=query,
model=model
)
return response['data'][0]['embedding']
# Function to find the most relevant chunks
def find_relevant_chunks(query, df, top_n=5):
query_embedding = get_query_embedding(query)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity([query_embedding], [eval(x)])[0][0])
return df.nlargest(top_n, 'similarity')
# Function to create a context window
def create_context_window(chunks):
context_window = "\n\n".join(chunks['text'].tolist())
return context_window
# Function to ask GPT-4o a question using the context window and get a response in Italian
def ask_gpt(question, context, model="gpt-4o"):
prompt = f"Context:\n{context}\n\nDomanda: {question}\nRisposta:"
response = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "Rispondi alle domande in italiano."},
{"role": "user", "content": prompt}
],
max_tokens=150,
temperature=0.7
)
return response['choices'][0]['message']['content']
# Example usage
# Example question in Italian
question = "Di che colore devono essere le superfici artificiali?"
# Use the user's question as the query
relevant_chunks = find_relevant_chunks(question, df)
context_window = create_context_window(relevant_chunks)
# Get the answer from GPT-4o in Italian
answer = ask_gpt(question, context_window)
print(answer)
####Conclusion
- In your opinion, what things should be improved of my process? Do you find something wrong?
- I still missing a thing. Doing this process i’ve not “injected gpt-4o with knowledge”. I’ve just said to gpt-4o like “look at these knowledge before responding”, right?
- Now, what would be my next step to inject this new knowledge in gpt-4o? I’m concerned about the context window expanding and becoming too costly. Can you help me address this issue?