Send CSV file for use in Chat Completion?

oliver.drobnik · May 8, 2023, 2:13pm

Is there a way to update a CSV as potential input for a Chat Completion? I see that there is a file update API (but apparently only for fine tuning) but the chat completion API doesn’t mention files.

GTP 4 told me this:

Unfortunately, it is not possible to give the ChatGPT API access to local files directly without uploading them over HTTP. The API runs on OpenAI’s servers and only has access to the data that is passed to it through its API endpoints.

However, if you have a large amount of data that you need to use with the ChatGPT API, you may want to consider using OpenAI’s Data API. The Data API allows you to upload your data to OpenAI’s servers and then use it with their language models, including ChatGPT.

To use the Data API, you would first need to upload your data to OpenAI’s servers using the POST /v1/data endpoint. Once your data is uploaded, you can use it with the ChatGPT API by specifying the data parameter in your API request.

Is this a hallucination, or an undocumented property on chat completion requests?

sps · May 8, 2023, 4:47pm

Hi @oliver.drobnik

Welcome to the community.

This is indeed hallucination/confabulation.

oliver.drobnik · May 8, 2023, 7:15pm

Ok, but what about my first question. Is there a way to facilitate this flow of data?

sps · May 8, 2023, 7:42pm

If you read the docs, you’ll understand that you don’t need to upload the csv, instead you break it into chunks and obtain embeddings.

Here’s the boilerplate code given in docs:

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

nelson · May 9, 2023, 7:38am

Hello @oliver.drobnik
Yes you can, and it works very well. Here are the steps you can take.

Parse the CSV file and for every row, modify the row to one line of key value sentence. For example
Name, Product, Review
Oliver, SD Card, I think this SD Card is great
Nelson, SD Card, This SD Card is horrible
will turn into multiple lines of the following…
Name: Oliver, Product: SD Card, Review: I think this SD Card is great
Name: Nelson, Product: SD Card, Review: This SD Card is horrible
For every line, turn them into a separate embedding
Encode(Name: Oliver, Product: SD Card, Review: I think this SD Card is great) => [223,231,132]
Encode(Name: Nelson, Product: SD Card, Review: This SD Card is horrible) => [323,351,112]
When a user ask a question, search for the most similar embedding and include them in the prompt.

Here is a screenshot of how it can work.
Reference Data

Chat Examples

If you like to see how it works in action before you implement it, you can try it at https://www.superinsight.ai

Hope that helps!

oliver.drobnik · May 9, 2023, 8:20am

Please forgive my ignorance. Also, I’d like to understand the steps at the lowest level using only the JSON-REST API.

So it seems to me that I should be able to change my CSV into embeddings. That’s done via an OpenAI embedding model, right? like https://platform.openai.com/docs/guides/embeddings/what-are-embeddings.

As I understand it here I now have two options: I should be able to give a list of embeddings as part of the messages in a chat completion request, right?

And if the list of embeddings is too big, then I would have to do it locally: I would put all the embeddings into a local vector database. Then I would turn the query itself into an embedding via the OpenAI API. Then I would do a local search for similar vectors in my local database. Then I would send those results again via a Chat completion request to OpenAI to answer, right?

I don’t see any way to send embedding vectors to the API. Would I have to save the source text together with the embedding vectors and then send the most relevant texts as part of the chat messages?

nelson · May 9, 2023, 8:59am

Why use embedding
The reason you want to use an embedding is because you want to find the most relevant context before you feed into your prompt, and also to manage prompt tokens. If this is not what you need, then you probably don’t need to use embedding.

What data to encode into embedding
You need to decide how to segment your data, to get good search results, you should try to separate your content base on their relevance. For example, a short paragraph of 300-500 tokens is a good chunk size. For CSV, you might also want to separate by rows if you might not to mix the rows in the same embedding. It’s really going to be base on your specific use case and the size of the text. To count the number of your tokens, you can use this package GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. to count the number of your tokens

Creating and storing the embedding
You can use OpenAI API or any other transformer models in the open source community. And yes, you can store embedding in vector database so you can perform semantic search. Regardless of which models you use, make sure you use the same model for all your embeddings.

Feeding it back to the prompt
Most Vector databases will only manage the embeddings for you, so when you perform a semantic search, most likely you will get back an ID and the embeddings with scores. You will need to map this ID to another database to retrieve the actual text. Once you have all your text, you might want to sort them by relevance and count your tokens again and make sure you don’t go over limit in your prompt.

Hope that answer your question, good luck!

oliver.drobnik · May 15, 2023, 7:33am

thank you for the extensive reply. Gives me a lot to think about for my specific use case.

Basically the core message is: I need to provide sufficient relevant facts together with my query for GPT to do its magic. If it is about text then embeddings and vector databases seem to be what most people use. I am beginning to understand why that is.

My goal is to have the data from my credit cards and bank accounts in one place from which I can pull data to build a budget from, possibly employing GPT to aid in the process, like categorising or coming up with good categories based on the transactions.

bill.french · May 15, 2023, 10:32am

I generally avoid this as it may add latency to the process. Instead, I store the actual text (and other meta-data) with the vectors depending on the data model of the solution. Pinecone supports this as well.

nelson · May 22, 2023, 1:57am

Thanks for the added input Bill.
Cheers!

nelson · May 22, 2023, 2:07am

Hi Oliver,

You are correct, most of the techniques are using embeddings and vector database to do searches.
Another important technique is the art of prompt engineering, if you can find creative ways to use this into your workflow, it will make a lot of difference. Wish you the best of luck in your AI journey. Cheers.

rajlakshmi0187 · June 13, 2023, 6:24am

I have a question here. Can we use this technique on quantitative data ?

sps · June 13, 2023, 11:44am

Why use embeddings for quantitative data when a DB can get the job done.

IMO, embeddings are primarily used for data types that have semantic meaning, such as text, where the goal is to capture and represent the relationships and similarities between different elements. Embeddings excel at capturing the contextual and semantic information inherent in such data.

rajlakshmi0187 · June 14, 2023, 7:25am

Ok so the context was, I am working on a use case where my input is a CSV file. I tried using langchain agent framework with gpt 3.5 turbo, but unfortunately not getting consistent results or getting wrong results and most of the times it’s giving me agent iteration limit or time limit issues.
I tried using max_iterations etc, but still the same.
I also read in couple of places that gpt-4 is necessary in the use of agents. Gpt 3.5 turbo makes many mistakes.

So, was wondering how can I deal with this still using gpt 3.5 turbo. Now, embeddings is something that works effectively on similarity search and works best on text, so would be helpful if can get some advice on how to deal with the issues.

Thanks

mandalorian · July 9, 2023, 9:39pm

I have seen the legacy ‘/completions’ API supported adding an array of “choices” within the free-form “prompt”.
I’d like to see an example that uses the newer ‘/chat/completions’ API.

Update: The example prompt below seems to work, Is there is a better way?
prompt: “Where do elephants live? restrict to the following knowledge:”+ mySemanticSearchReducedText;

Response:

{
  "id": "chatcmpl-myid",
  "object": "chat.completion",
  "created": 1688942778,
  "model": "gpt-3.5-turbo-16k-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Based on the provided knowledge, elephants do not have a specific habitat mentioned."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 93,
    "completion_tokens": 15,
    "total_tokens": 108
  }
}

sps · July 10, 2023, 9:07am

Welcome @mandalorian

This is one of the many ways.

Alternatively you can specify a system message to pass the knowledge/context, like I did in

Use embeddings to retrieve relevant context for AI assistant

contextMessage = [{
            "role":
            "system",
            "name":
            systemName,
            "content":
            f"{assistantName}'s knowledge: {contextValue} + Previous messages\nOnly answer next message."
        }]

mandalorian · July 10, 2023, 7:10pm

This example and my prompt that worked yesterday did not work today.
Found another prompt that worked: "Don’t justify your answers. Don’t give information not mentioned in the CONTEXT INFORMATION that follows: "+knowledge

nelson · July 19, 2023, 11:28am

Yes, I agree.
The older model gives you a lot more flexibility on the prompt, but I think your prompt prefix is pretty good already. What I am seeing that makes the most differences now is the ability to create the highest quality of mySemanticSearchReducedText. So that’s where I spend most of my effort refining.

david19971108 · July 25, 2023, 8:47am

So if i fave a csv file of agricultural data, I don,t need to use embedding just turn to string to use api?

Topic		Replies	Views
How can I send vectors as a chat context? Prompting embeddings	8	8031	May 15, 2023
Reducing Cost of GPT 4 by using embeddings Prompting	23	10370	May 4, 2023
About the usage of ChatGPT Embedding API	9	4331	August 18, 2023
Using GPT to Search & Pull Recommendations from a Database? API	23	9721	August 22, 2024
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17343	December 17, 2023

Send CSV file for use in Chat Completion?

Related topics