Is there a way to update a CSV as potential input for a Chat Completion? I see that there is a file update API (but apparently only for fine tuning) but the chat completion API doesn’t mention files.
GTP 4 told me this:
Unfortunately, it is not possible to give the ChatGPT API access to local files directly without uploading them over HTTP. The API runs on OpenAI’s servers and only has access to the data that is passed to it through its API endpoints.
However, if you have a large amount of data that you need to use with the ChatGPT API, you may want to consider using OpenAI’s Data API. The Data API allows you to upload your data to OpenAI’s servers and then use it with their language models, including ChatGPT.
To use the Data API, you would first need to upload your data to OpenAI’s servers using the POST /v1/data endpoint. Once your data is uploaded, you can use it with the ChatGPT API by specifying the data parameter in your API request.
Is this a hallucination, or an undocumented property on chat completion requests?
Yes you can, and it works very well. Here are the steps you can take.
Parse the CSV file and for every row, modify the row to one line of key value sentence. For example Name, Product, Review Oliver, SD Card, I think this SD Card is great Nelson, SD Card, This SD Card is horrible
will turn into multiple lines of the following… Name: Oliver, Product: SD Card, Review: I think this SD Card is great Name: Nelson, Product: SD Card, Review: This SD Card is horrible
For every line, turn them into a separate embedding Encode(Name: Oliver, Product: SD Card, Review: I think this SD Card is great) => [223,231,132] Encode(Name: Nelson, Product: SD Card, Review: This SD Card is horrible) => [323,351,112]
When a user ask a question, search for the most similar embedding and include them in the prompt.
Here is a screenshot of how it can work.
As I understand it here I now have two options: I should be able to give a list of embeddings as part of the messages in a chat completion request, right?
And if the list of embeddings is too big, then I would have to do it locally: I would put all the embeddings into a local vector database. Then I would turn the query itself into an embedding via the OpenAI API. Then I would do a local search for similar vectors in my local database. Then I would send those results again via a Chat completion request to OpenAI to answer, right?
I don’t see any way to send embedding vectors to the API. Would I have to save the source text together with the embedding vectors and then send the most relevant texts as part of the chat messages?
Why use embedding
The reason you want to use an embedding is because you want to find the most relevant context before you feed into your prompt, and also to manage prompt tokens. If this is not what you need, then you probably don’t need to use embedding.
What data to encode into embedding
You need to decide how to segment your data, to get good search results, you should try to separate your content base on their relevance. For example, a short paragraph of 300-500 tokens is a good chunk size. For CSV, you might also want to separate by rows if you might not to mix the rows in the same embedding. It’s really going to be base on your specific use case and the size of the text. To count the number of your tokens, you can use this package GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. to count the number of your tokens
Creating and storing the embedding
You can use OpenAI API or any other transformer models in the open source community. And yes, you can store embedding in vector database so you can perform semantic search. Regardless of which models you use, make sure you use the same model for all your embeddings.
Feeding it back to the prompt
Most Vector databases will only manage the embeddings for you, so when you perform a semantic search, most likely you will get back an ID and the embeddings with scores. You will need to map this ID to another database to retrieve the actual text. Once you have all your text, you might want to sort them by relevance and count your tokens again and make sure you don’t go over limit in your prompt.
thank you for the extensive reply. Gives me a lot to think about for my specific use case.
Basically the core message is: I need to provide sufficient relevant facts together with my query for GPT to do its magic. If it is about text then embeddings and vector databases seem to be what most people use. I am beginning to understand why that is.
My goal is to have the data from my credit cards and bank accounts in one place from which I can pull data to build a budget from, possibly employing GPT to aid in the process, like categorising or coming up with good categories based on the transactions.
I generally avoid this as it may add latency to the process. Instead, I store the actual text (and other meta-data) with the vectors depending on the data model of the solution. Pinecone supports this as well.
You are correct, most of the techniques are using embeddings and vector database to do searches.
Another important technique is the art of prompt engineering, if you can find creative ways to use this into your workflow, it will make a lot of difference. Wish you the best of luck in your AI journey. Cheers.
Why use embeddings for quantitative data when a DB can get the job done.
IMO, embeddings are primarily used for data types that have semantic meaning, such as text, where the goal is to capture and represent the relationships and similarities between different elements. Embeddings excel at capturing the contextual and semantic information inherent in such data.
Ok so the context was, I am working on a use case where my input is a CSV file. I tried using langchain agent framework with gpt 3.5 turbo, but unfortunately not getting consistent results or getting wrong results and most of the times it’s giving me agent iteration limit or time limit issues.
I tried using max_iterations etc, but still the same.
I also read in couple of places that gpt-4 is necessary in the use of agents. Gpt 3.5 turbo makes many mistakes.
So, was wondering how can I deal with this still using gpt 3.5 turbo. Now, embeddings is something that works effectively on similarity search and works best on text, so would be helpful if can get some advice on how to deal with the issues.
This example and my prompt that worked yesterday did not work today.
Found another prompt that worked: "Don’t justify your answers. Don’t give information not mentioned in the CONTEXT INFORMATION that follows: "+knowledge
Yes, I agree.
The older model gives you a lot more flexibility on the prompt, but I think your prompt prefix is pretty good already. What I am seeing that makes the most differences now is the ability to create the highest quality of mySemanticSearchReducedText. So that’s where I spend most of my effort refining.